Professional Documents
Culture Documents
PySpark SQL Functions' array(~) method combines multiples columns into a single column of
arrays.
NOTE
If you want to combine multiple columns of array-type, then use the concat(~) instead.
Parameters
1. *cols | string or Column
The columns to combine.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([['A', 'a', '1'], ['B', 'b', '2'], ['C', 'c', '3']], ['col1', 'col2', 'col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| a| 1|
| B| b| 2|
| C| c| 3|
+----+----+----+
To combine the columns col1, col2 and col3 into a single column of arrays, use
the array(~) method:
filter_none
from pyspark.sql import functions as F
# Assign label to PySpark column returned by array(~) using alias(~)
df.select(F.array('col1','col2','col3').alias('combined_col')).show()
+------------+
|combined_col|
+------------+
| [A, a, 1]|
| [B, b, 2]|
| [C, c, 3]|
+------------+
Instead of passing a column labels, we could also supply Column objects instead:
filter_none
df.select(F.array(F.col('col1'),df['col2'],'col3').alias('combined_col')).show()
+------------+
|combined_col|
+------------+
| [A, a, 1]|
| [B, b, 2]|
| [C, c, 3]|
+------------+
PySpark SQL Functions | col method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' collect_set(~) method returns a unique set of values in a column.
Null values are ignored.
NOTE
Use collect_list(~) instead to obtain a list of values that allows for duplicates.
Parameters
1. col | string or Column object
The column label or a Column object.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
WARNING
Assume that the order of the returned set may be random since the order is affected
by shuffle operationslink.
Examples
Consider the following PySpark DataFrame:
filter_none
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
df = spark.createDataFrame(data, ["name", "group"])
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
Getting a set of column values in PySpark
To get the unique set of values in the group column:
filter_none
import pyspark.sql.functions as F
df.select(F.collect_set("group")).show()
+------------------+
|collect_set(group)|
+------------------+
| [C, B, A]|
+------------------+
Equivalently, you can pass in a Column object to collect_set(~) as well:
filter_none
import pyspark.sql.functions as F
df.select(F.collect_set(df.group)).show()
+------------------+
|collect_set(group)|
+------------------+
| [C, B, A]|
+------------------+
Notice how the null value does not appear in the resulting set.
Getting the set as a standard list
To get the set as a standard list:
filter_none
list_rows = df.select(F.collect_set(df.group)).collect()
list_rows[0][0]
['C', 'B', 'A']
Here, the PySpark DataFrame's collect() method returns a list of Row objects. This list is
guaranteed to be length one due to the nature of collect_set(~). The Row object contains
the list so we need to include another [0].
Getting a set of column values of each group in PySpark
The method collect_set(~) is often used in the context of aggregation. Consider the same
PySpark DataFrame as before:
filter_none
df.show()
+-----+-----+
| name|group|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| A|
|Cathy| C|
| Dave| null|
+-----+-----+
To flatten the group column into a single set for each name:
filter_none
import pyspark.sql.functions as F
df.groupby("name").agg(F.collect_set("group")).show()
+-----+------------------+
| name|collect_set(group)|
+-----+------------------+
| Alex| [B, A]|
| Bob| [A]|
|Cathy| [C]|
+-----+------------------+
PySpark SQL Functions | concat method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' concat(~) method concatenates string and array columns.
Parameters
1. *cols | string or Column
The columns to concat.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Concatenating string-based columns in PySpark
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "Wong"], ["Bob", "Marley"]], ["fname", "lname"])
df.show()
+-----+------+
|fname| lname|
+-----+------+
| Alex| Wong|
| Bob|Marley|
+-----+------+
To concatenate fname and lname:
filter_none
import pyspark.sql.functions as F
df.select(F.concat("fname", "lname")).show()
+--------------------+
|concat(fname, lname)|
+--------------------+
| AlexWong|
| BobMarley|
+--------------------+
If you wanted to include a space between the two columns, you can use F.lit(" ") like so:
filter_none
import pyspark.sql.functions as F
df.select(F.concat("fname", F.lit(" "), "lname")).show()
+-----------------------+
|concat(fname, , lname)|
+-----------------------+
| Alex Wong|
| Bob Marley|
+-----------------------+
F.lit(" ") is a Column object whose values are filled with " ".
You could also add an alias to the returned column like so:
filter_none
df.select(F.concat("fname", "lname").alias("COMBINED NAME")).show()
+-------------+
|COMBINED NAME|
+-------------+
| AlexWong|
| BobMarley|
+-------------+
You could also pass in Column objects intead of column labels:
filter_none
df.select(F.concat(df.fname, F.col("lname"))).show()
+--------------------+
|concat(fname, lname)|
+--------------------+
| AlexWong|
| BobMarley|
+--------------------+
Concatenating array-based columns in PySpark
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([[ [4,5], [6]], [ [7], [8,9] ]], ["A", "B"])
df.show()
+------+------+
| A| B|
+------+------+
|[4, 5]| [6]|
| [7]|[8, 9]|
+------+------+
To concatenate the arrays of each column:
filter_none
import pyspark.sql.functions as F
df.select(F.concat("A", "B")).show()
+------------+
|concat(A, B)|
+------------+
| [4, 5, 6]|
| [7, 8, 9]|
+------------+
PySpark SQL Functions | concat_ws method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' count_distinct(~) method counts the number of distinct values in the
specified columns.
Parameters
1. *cols | string or Column
The columns in which to count the number of distinct values.
Return Value
A PySpark Column holding an integer.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A"], ["Bob", "A"], ["Cathy", "B"]], ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Bob| A|
|Cathy| B|
+-----+-----+
Counting the number of distinct values in a single column in PySpark
To count the number of distinct values in the class column:
filter_none
from pyspark.sql import functions as F
df.select(F.count_distinct("class").alias("c")).show()
+---+
| c|
+---+
| 2|
+---+
Here, we are giving the name "c" to the Column returned by count_distinct(~) via alias(~).
Note that we could also supply a Column object to count_distinct(~) instead:
filter_none
df.select(F.count_distinct(df["class"]).alias("c")).show()
+---+
| c|
+---+
| 2|
+---+
Obtaining an integer count
By default, count_distinct(~) returns a PySpark Column. To get an integer count instead:
filter_none
df.select(F.count_distinct(df["class"])).collect()[0][0]
2
Here, we are use the select(~) method to convert the Column into PySpark DataFrame. We
then use the collect(~) method to convert the DataFrame into a list of Row objects. Since
there is only one Row in this list as well as one value in the Row, we use [0][0] to access the
integer count.
Counting the number of distinct values in a set of columns in PySpark
To count the number of distinct values for the columns name and class:
filter_none
df.select(F.count_distinct("name", "class").alias("c")).show()
+---+
| c|
+---+
| 3|
+---+
PySpark SQL Functions | countDistinct method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' countDistinct(~) method returns the distinct number of rows for the
specified columns.
Parameters
1. col | string or Column
The column to consider when counting distinct rows.
2. *col | string or Column | optional
The additional columns to consider when counting distinct rows.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 25], ["Bob", 30], ["Alex", 25], ["Alex", 50]], ["name",
"age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 25|
|Alex| 50|
+----+---+
Counting the number of distinct values in single PySpark column
To count the number of distinct rows in the column name:
filter_none
import pyspark.sql.functions as F
df.select(F.countDistinct("name")).show()
+--------------------+
|count(DISTINCT name)|
+--------------------+
| 2|
+--------------------+
Note that instead of passing in the column label ("name"), you can pass in a Column object
like so:
filter_none
# df.select(F.countDistinct(df.name)).show()
df.select(F.countDistinct(F.col("name"))).show()
+--------------------+
|count(DISTINCT name)|
+--------------------+
| 2|
+--------------------+
Counting the number of distinct values in multiple PySpark columns
To consider the columns name and age when counting duplicate rows:
filter_none
df.select(F.countDistinct("name", "age")).show()
+-------------------------+
|count(DISTINCT name, age)|
+-------------------------+
| 3|
+-------------------------+
Counting the number of distinct rows in PySpark DataFrame
To consider all columns when counting duplicate rows, pass in "*":
filter_none
df.select(F.countDistinct("*")).show()
+-------------------------+
|count(DISTINCT name, age)|
+-------------------------+
| 3|
+-------------------------+
PySpark SQL Functions | date_add method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's date_add(-) method adds the specified number of days to a date column.
Parameters
1. start |
The column of starting dates.
2. days | int
The number of days to add.
Return Value
A pyspark.sql.column.Column object.
Examples
Basic usage
Consider the following DataFrame:
filter_none
df = spark.createDataFrame([["2023-04-20"], ["2023-04-22"]], ["my_date"])
df.show()
+----------+
| my_date|
+----------+
|2023-04-20|
|2023-04-22|
+----------+
To add 5 days to our column:
filter_none
from pyspark.sql import functions as F
df.select(F.date_add("my_date", 5)).show()
+--------------------+
|date_add(my_date, 5)|
+--------------------+
| 2023-04-25|
| 2023-04-27|
+--------------------+
Adding a column of days to a column of dates
Unfortunately, the date_add(-) method only accepts a constant for the second parameter.
To add a column of days to a column of dates, we must take another approach.
To demonstrate, consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["2023-04-20", 3], ["2023-04-22", 5]], ["my_date", "my_days"])
df.show()
+----------+-------+
| my_date|my_days|
+----------+-------+
|2023-04-20| 3|
|2023-04-22| 5|
+----------+-------+
To add my_days to my_date, supply the following SQL method in the F.expr() method like
so:
filter_none
# Cast to INT first - by default, intgers have type BIGINT (F.expr(-) with raise an error)
df = df.withColumn("my_days", df["my_days"].cast("int"))
df_new = df_new.withColumn("new_date", F.expr("date_add(my_date, my_days)"))
df_new.show()
+----------+-------+----------+
| my_date|my_days| new_date|
+----------+-------+----------+
|2023-04-20| 3|2023-04-23|
|2023-04-22| 5|2023-04-27|
+----------+-------+----------+
The resulting data type of the columns is as follows:
filter_none
df_new.printSchema()
root
|-- my_date: string (nullable = true)
|-- days: integer (nullable = true)
|-- new_date: date (nullable = true)
Notice how even though my_date is of type string, the resulting new_date is of type date.
PySpark SQL Functions | date_format method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' date_format(~) method converts a date, timestamp or string into a
date string with the specified format.
Parameters
1. date | Column or string
The date column - this could be of type date, timestamp or string.
2. format | string
The format of the resulting date string.
Return Value
A Column object of date strings.
Examples
Formatting date strings in PySpark DataFrame
Consider the following PySpark DataFrame with some date strings:
filter_none
df = spark.createDataFrame([["Alex", "1995-12-16"], ["Bob", "1998-05-06"]], ["name",
"birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
To convert the date strings in the column birthday:
filter_none
from pyspark.sql import functions as F
df.select(F.date_format("birthday", "dd/MM/yyyy").alias("birthday_new")).show()
+------------+
|birthday_new|
+------------+
| 16/12/1995|
| 06/05/1998|
+------------+
Here,:
"dd/MM/yyyy" indicates a date string starting with the day, then month, then year.
alias(~) is used to give a name to the Column object returned by date_format(~).
Formatting datetime values in PySpark DataFrame
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob",
datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
To convert the datetime values in column birthday:
filter_none
df.select(F.date_format("birthday", "dd-MM-yyyy").alias("birthday_new")).show()
+------------+
|birthday_new|
+------------+
| 16-12-1995|
| 06-05-1998|
+------------+
Here, we are using the date format "dd-MM-yyyy", which means day first, and then month
followed by year. We also assign the column name "birthday_new" to the Column returned
by date_format().
PySpark SQL Functions | dayofmonth method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' dayofmonth(~) method extracts the day component of each column
value, which can be of type date or a date string.
Parameters
1. col | string or Column
The date column from which to extract the day of the month.
Return Value
A Column object of integers.
Examples
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob",
datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Extracting the day component of datetime values in PySpark DataFrame
To extract the day component of date values:
filter_none
from pyspark.sql import functions as F
df.select(F.dayofmonth("birthday").alias("day")).show()
+---+
|day|
+---+
| 16|
| 9|
+---+
Here, we are assigning a label to the Column returned by dayofmonth(~) using alias(~).
Extracting the day component of date strings in PySpark DataFrame
To extract the day component of date strings:
filter_none
df = spark.createDataFrame([["Alex", "1995-12-16"], ["Bob", "1990-05-06"]], ["name",
"birthday"])
df.select(F.dayofmonth("birthday").alias("day")).show()
+---+
|day|
+---+
| 16|
| 9|
+---+
PySpark SQL Functions | dayofweek method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' dayofweek(~) method extracts the day of the week of each datetime
value or date string of a PySpark column.
Parameters
1. col | string or Column
The date column from which to extract the day of the week.
Return Value
A Column of integers.
Examples
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob",
datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Getting the day of the week from datetime values in PySpark DataFrame
To get the day of the week in PySpark DataFrame:
filter_none
import pyspark.sql.functions as F
df.select(F.dayofweek('birthday').alias('day')).show()
+---+
|day|
+---+
| 7|
| 1|
+---+
Here:
7 represents Sunday while 1 represents Monday.
we are using alias(~) to give a label to the Column object returned by dayofweek(~).
Getting the day of week from date strings in PySpark DataFrame
Note that the method still works even if the date is in string form:
filter_none
df = spark.createDataFrame([["Alex", "1995-12-16"], ["Bob", "1990-05-06"]], ["name",
"birthday"])
df.select(F.dayofweek("birthday").alias("day")).show()
+---+
|day|
+---+
| 7|
| 1|
+---+
PySpark SQL Functions | dayofyear method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' dayofyear(~) method extracts the day of the year of each datetime
string or datetime value in a PySpark column.
Parameters
1. col | string or Column
The date column from which to extract the day of the year.
Return Value
A Column object of integers.
Examples
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob",
datetime.date(1995,5,6)]], ["name", "birthday"])
df.show()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-06|
+----+----------+
Getting the day of the year of date values in PySpark DataFrame
To get the day of the year of date values:
filter_none
import pyspark.sql.functions as F
df.select(F.dayofyear("birthday").alias("day")).show()
+---+
|day|
+---+
|350|
|126|
+---+
Here, we are assigning the label "day" to the Column returned by dayofyear(~).
Getting the day of the year of date strings in PySpark DataFrame
Note that the method still works even if date is in string form:
filter_none
df = spark.createDataFrame([["Alex", "1995-12-16"], ["Bob", "1990-05-06"]], ["name",
"birthday"])
df.select(F.dayofyear("birthday").alias("day")).show()
+---+
|day|
+---+
|350|
|126|
+---+
PySpark SQL Functions | element_at method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in
a PySpark Column.
Parameters
1. col | string or Column
The column of lists or maps from which to extract values.
2. extraction | int
The position of the value that you wish to extract. Negative positioning is supported
- extraction=-1 will extract the last element from each list.
WARNING
The position is not indexed-based. This means that extraction=1 will extract the first value in
the lists or maps.
Return Value
A new PySpark Column.
Examples
Extracting n-th value from arrays in PySpark Column
Consider the following PySpark DataFrame that contains some lists:
filter_none
rows = [[[5,6]], [[7,8]]]
df = spark.createDataFrame(rows, ['vals'])
df.show()
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
To extract the second value from each list in vals, we can use element_at(~) like so:
filter_none
df_res = df.select(F.element_at('vals',2).alias('2nd value'))
df_res.show()
+---------+
|2nd value|
+---------+
| 6|
| 8|
+---------+
Here, note the following:
the position 2 is not index-based.
we are using the alias(~) method to assign a label to the column returned by element_at(~).
Note that extracting values that are out of bounds will return null:
filter_none
df_res = df.select(F.element_at('vals',3))
df_res.show()
+-------------------+
|element_at(vals, 3)|
+-------------------+
| null|
| null|
+-------------------+
We can also extract the last element by supplying a negative value for extraction:
filter_none
df_res = df.select(F.element_at('vals',-1).alias('last value'))
df_res.show()
+----------+
|last value|
+----------+
| 6|
| 8|
+----------+
Extracting values from maps in PySpark Column
Consider the following PySpark DataFrame containing some dict values:
filter_none
rows = [[{'A':4}], [{'A':5, 'B':6}]]
df = spark.createDataFrame(rows, ['vals'])
df.show()
+----------------+
| vals|
+----------------+
| {A -> 4}|
|{A -> 5, B -> 6}|
+----------------+
To extract the values that has the key 'A' in the vals column:
filter_none
df_res = df.select(F.element_at('vals', F.lit('A')))
df_res.show()
+-------------------+
|element_at(vals, A)|
+-------------------+
| 4|
| 5|
+-------------------+
Note that extracting values using keys that do not exist will return null:
filter_none
df_res = df.select(F.element_at('vals', F.lit('B')))
df_res.show()
+-------------------+
|element_at(vals, B)|
+-------------------+
| null|
| 6|
+-------------------+
Here, the key 'B' does not exist in the map {'A':4} so a null was returned for that row.
RELATED
PySpark SQL Functions' explode(~) method flattens the specified column values of
type list or dictionary.
Parameters
1. col | string or Column
The column containing lists or dictionaries to flatten.
Return Value
A new PySpark Column.
Examples
Flattening lists
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([[['a','b']],[['d']]], ['vals'])
df.show()
+------+
| vals|
+------+
|[a, b]|
| [d]|
+------+
Here, the column vals contains lists.
To flatten the lists in the column vals, use the explode(~) method:
filter_none
import pyspark.sql.functions as F
df.select(F.explode('vals').alias('exploded')).show()
+--------+
|exploded|
+--------+
| a|
| b|
| d|
+--------+
Here, we are using the alias(~) method to assign a label to the column returned
by explode(~).
Flattening dictionaries
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([[{'a':'b'}],[{'c':'d','e':'f'}]], ['vals'])
df.show()
+----------------+
| vals|
+----------------+
| {a -> b}|
|{e -> f, c -> d}|
+----------------+
Here, the column vals contains dictionaries.
To flatten each dictionary in column vals, use the explode(~) method:
filter_none
df.select(F.explode('vals').alias('exploded_key', 'exploded_val')).show()
+------------+------------+
|exploded_key|exploded_val|
+------------+------------+
| a| b|
| e| f|
| c| d|
+------------+------------+
In the case of dictionaries, the explode(~) method returns two columns - the first column
contains all the keys while the second column contains all the values.
PySpark SQL Functions | expr method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' expr(~) method parses the given SQL expression.
Parameters
1. str | string
The SQL expression to parse.
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([['Alex',30],['Bob',50]], ['name','age'])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 30|
| Bob| 50|
+----+---+
Using the expr method to convert column values to uppercase
The expr(~) method takes in as argument a SQL expression, so we can use SQL functions
such as upper(~):
filter_none
import pyspark.sql.functions as F
df.select(F.expr('upper(name)')).show()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
NOTE
The expr(~) method can often be more succinctly written using PySpark
DataFrame's selectExpr(~) method. For instance, the above case can be rewritten as:
filter_none
df.selectExpr('upper(name)').show()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
I recommend that you use selectExpr(~) whenever possible because:
you won't have to import the SQL functions library (pyspark.sql.functions).
syntax is shorter
Parsing complex SQL expressions using expr method
Here's a more complex SQL expression using clauses like AND and LIKE:
filter_none
df.select(F.expr('age > 40 AND name LIKE "B%"').alias('result')).show()
+------+
|result|
+------+
| false|
| true|
+------+
Note the following:
we are checking for rows where age is larger than 40 and name starts with B.
we are assigning the label 'result' to the Column returned by expr(~) using
the alias(~) method.
Practical applications of boolean masks returned by expr method
As we can see in the above example, the expr(~) method can return a boolean mask
depending on the SQL expression you supply:
filter_none
df.select(F.expr('age > 40 AND name LIKE "B%"').alias('result')).show()
+------+
|result|
+------+
| false|
| true|
+------+
This allows us to check for the existence of rows that satisfy a given condition using any(~):
filter_none
df.select(F.expr('any(age > 40 AND name LIKE "B%")').alias('exists?')).show()
+-------+
|exists?|
+-------+
| true|
+-------+
Here, we get True because there exists at least one True value in the boolean mask.
Mapping column values using expr method
We can map column values using CASE WHEN in the expr(~) method like so:
filter_none
col = F.expr('CASE WHEN age < 40 THEN "JUNIOR" ELSE "SENIOR" END').alias('result')
df.withColumn('status', col).show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 30|JUNIOR|
| Bob| 50|SENIOR|
+----+---+------+
Here, note the following:
we are using the DataFrame's withColumn(~) method to obtain a new PySpark DataFrame
that includes the column returned by expr(~).
RELATED
PySpark's SQL function first(~) method returns the first value of the specified column of a
PySpark DataFrame.
Parameters
1. col | string or Column object
The column label or Column object of interest.
2. ignorenulls | boolean | optional
Whether or not to ignore null values. By default, ignorenulls=False.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
columns = ["name", "age"]
data = [("Alex", 15), ("Bob", 20), ("Cathy", 25)]
df = spark.createDataFrame(data, columns)
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the first value of a column in PySpark DataFrame
To get the first value of the name column:
filter_none
import pyspark.sql.functions as F
df.select(F.first(df.name)).show()
+-----------+
|first(name)|
+-----------+
| Alex|
+-----------+
Getting the first non-null value of a column in PySpark DataFrame
Consider the following PySpark DataFrame with null values:
filter_none
columns = ["name", "age"]
data = [("Alex", None), ("Bob", 20), ("Cathy", 25)]
df = spark.createDataFrame(data, columns)
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex|null|
| Bob| 20|
|Cathy| 25|
+-----+----+
By default, ignorenulls=False, which means that the first value is returned regardless of
whether it is null or not:
filter_none
df.select(F.first(df.age)).show()
+----------+
|first(age)|
+----------+
| null|
+----------+
To return the first non-null value instead:
filter_none
df.select(F.first(df.age, ignorenulls=True)).show()
+----------+
|first(age)|
+----------+
| 20|
+----------+
Getting the first value of each group in PySpark
The first(~) method is also useful in aggregations. Consider the following PySpark
DataFrame:
filter_none
columns = ["name", "class"]
data = [("Alex", "A"), ("Alex", "B"), ("Bob", None), ("Bob", "A"), ("Cathy", "C")]
df = spark.createDataFrame(data, columns)
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the first value of each aggregate:
filter_none
df.groupby("name").agg(F.first("class")).show()
+-----+------------+
| name|first(class)|
+-----+------------+
| Alex| A|
| Bob| null|
|Cathy| C|
+-----+------------+
Here, we are grouping by name, and then for each of these group, we are obtaining the first
value that occurred in the class column.
PySpark SQL Functions | greatest method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' greatest(~) method returns the maximum value of each row in the
specified columns. Note that you must specify two or more columns.
Parameters
1. *cols | string or Column
The columns from which to compute the maximum values.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 100, 200], ["Bob", 150, 50]], ["name", "salary",
"bonus"])
df.show()
+----+------+-----+
|name|salary|bonus|
+----+------+-----+
|Alex| 100| 200|
| Bob| 150| 50|
+----+------+-----+
Getting the largest value of each row in PySpark DataFrame
To get the largest value of each row in the columns salary and bonus:
filter_none
import pyspark.sql.functions as F
df.select(F.greatest("salary", "bonus")).show()
+-----------------------+
|greatest(salary, bonus)|
+-----------------------+
| 200|
| 150|
+-----------------------+
To append this column to the existing PySpark DataFrame, use the withColumn(~) method:
filter_none
import pyspark.sql.functions as F
df.withColumn("my_max", F.greatest("salary", "bonus")).show()
+----+------+-----+------+
|name|salary|bonus|my_max|
+----+------+-----+------+
|Alex| 100| 200| 200|
| Bob| 150| 50| 150|
+----+------+-----+------+
PySpark SQL Functions | instr method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' instr(~) method returns a new PySpark Column holding the position
of the first occurrence of the specified substring in each value of the specified column.
WARNING
The position is not index-based, and starts from 1 instead of 0.
Parameters
1. str | string or Column
The column to perform the operation on.
2. substr | string
The substring of which to check the position.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([("ABA",), ("BBB",), ("CCC",), (None,)], ["x",])
df.show()
+----+
| x|
+----+
| ABA|
| BBB|
| CCC|
|null|
+----+
Getting the position of the first occurrence of a substring in PySpark Column
To get the position of the first occurrence of the substring "B" in column x, use
the instr(~) method:
filter_none
df.select(F.instr("x", "B")).show()
+-----------+
|instr(x, B)|
+-----------+
| 2|
| 1|
| 0|
| null|
+-----------+
Here, note the following:
we see 2 returned for the column value "ABA" because the substring "B" occurs in the 2nd
position - remember, this method counts position from 1 instead of 0.
if the substring does not exist in the string, then a value of 0 is returned. This is the case
for "Cathy" because this string does not include "B".
if the string is null, then the result will also be null.
PySpark SQL Functions' isnan(-) method returns True where the column value is NaN (not-a-
number).
NOTE
PySpark treats null and NaN as separate entities. Please refer to our isNull(-) method for
more details.
Parameters
1. col | string or Column object
The column label or Column object to target.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
import numpy as np
df = spark.createDataFrame([["Alex", 25.0], ["Bob", np.nan], ["Cathy", float("nan")]],
["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob| NaN|
|Cathy| NaN|
+-----+----+
To get all rows where age is NaN:
filter_none
from pyspark.sql import functions as F
df.where(F.isnan("age")).show()
+-----+---+
| name|age|
+-----+---+
| Bob|NaN|
|Cathy|NaN|
+-----+---+
PySpark SQL Functions | last method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's SQL function last(~) method returns the last row of the DataFrame.
Parameters
1. col | string or Column object
The column label or Column object of interest.
2. ignorenulls | boolean | optional
Whether or not to ignore null values. By default, ignorenulls=False.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", 25)], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the last value of a PySpark column
To get the last value of the name column:
filter_none
df.select(F.last("name")).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Note we can also pass a Column object instead:
filter_none
import pyspark.sql.functions as F
# df.select(F.last(F.col("name"))).show()
df.select(F.last(df.name)).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Getting the last non-null value in PySpark column
Consider the following PySpark DataFrame with null values:
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", None)], ["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex| 15|
| Bob| 20|
|Cathy|null|
+-----+----+
By default, ignorenulls=False, which means that the last value is returned regardless of
whether it is null or not:
filter_none
df.select(F.last(df.age)).show()
+---------+
|last(age)|
+---------+
| null|
+---------+
To return the last non-null value instead, set ignorenulls=True:
filter_none
df.select(F.last(df.age, ignorenulls=True)).show()
+---------+
|last(age)|
+---------+
| 20|
+---------+
Getting the last value of each group in PySpark
The last(~) method is also useful in aggregations. Consider the following PySpark DataFrame:
filter_none
data = [("Alex", "A"), ("Alex", "B"), ("Bob", None), ("Bob", "A"), ("Cathy", "C")]
df = spark.createDataFrame(data, ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the last value of each aggregate:
filter_none
df.groupby("name").agg(F.last("class")).show()
+-----+-----------+
| name|last(class)|
+-----+-----------+
| Alex| B|
| Bob| A|
|Cahty| C|
+-----+-----------+
Here, we are grouping by name, and then for each of these group, we are obtaining the last
value that occurred in the class column.
PySpark SQL Functions | least method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' least(~) takes as input multiple columns, and returns a PySpark
Column holding the least value for every row of the input columns.
Parameters
1. *cols | string or Column
The input columns whose row values will be compared to check.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([[20,10], [30,50], [40,90]], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| 20| 10|
| 30| 50|
| 40| 90|
+---+---+
Getting the least value for every row of specified columns in PySpark
To get the least value for every row of columns A and B:
filter_none
import pyspark.sql.functions as F
df.select(F.least("A","B")).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
We can also pass Column objects instead of column labels:
filter_none
df.select(F.least(df.A,df.B)).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
Note that we can append the Column returned by least(~) by using withColumn(~):
filter_none
df.withColumn("smallest", F.least("A","B")).show()
+---+---+--------+
| A| B|smallest|
+---+---+--------+
| 20| 10| 10|
| 30| 50| 30|
| 40| 90| 40|
+---+---+--------+
PySpark SQL Functions | last method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's SQL function last(~) method returns the last row of the DataFrame.
Parameters
1. col | string or Column object
The column label or Column object of interest.
2. ignorenulls | boolean | optional
Whether or not to ignore null values. By default, ignorenulls=False.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", 25)], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Getting the last value of a PySpark column
To get the last value of the name column:
filter_none
df.select(F.last("name")).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Note we can also pass a Column object instead:
filter_none
import pyspark.sql.functions as F
# df.select(F.last(F.col("name"))).show()
df.select(F.last(df.name)).show()
+----------+
|last(name)|
+----------+
| Cathy|
+----------+
Getting the last non-null value in PySpark column
Consider the following PySpark DataFrame with null values:
filter_none
df = spark.createDataFrame([("Alex", 15), ("Bob", 20), ("Cathy", None)], ["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex| 15|
| Bob| 20|
|Cathy|null|
+-----+----+
By default, ignorenulls=False, which means that the last value is returned regardless of
whether it is null or not:
filter_none
df.select(F.last(df.age)).show()
+---------+
|last(age)|
+---------+
| null|
+---------+
To return the last non-null value instead, set ignorenulls=True:
filter_none
df.select(F.last(df.age, ignorenulls=True)).show()
+---------+
|last(age)|
+---------+
| 20|
+---------+
Getting the last value of each group in PySpark
The last(~) method is also useful in aggregations. Consider the following PySpark DataFrame:
filter_none
data = [("Alex", "A"), ("Alex", "B"), ("Bob", None), ("Bob", "A"), ("Cathy", "C")]
df = spark.createDataFrame(data, ["name", "class"])
df.show()
+-----+-----+
| name|class|
+-----+-----+
| Alex| A|
| Alex| B|
| Bob| null|
| Bob| A|
|Cathy| C|
+-----+-----+
To get the last value of each aggregate:
filter_none
df.groupby("name").agg(F.last("class")).show()
+-----+-----------+
| name|last(class)|
+-----+-----------+
| Alex| B|
| Bob| A|
|Cahty| C|
+-----+-----------+
Here, we are grouping by name, and then for each of these group, we are obtaining the last
value that occurred in the class column.
PySpark SQL Functions | least method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' least(~) takes as input multiple columns, and returns a PySpark
Column holding the least value for every row of the input columns.
Parameters
1. *cols | string or Column
The input columns whose row values will be compared to check.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([[20,10], [30,50], [40,90]], ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| 20| 10|
| 30| 50|
| 40| 90|
+---+---+
Getting the least value for every row of specified columns in PySpark
To get the least value for every row of columns A and B:
filter_none
import pyspark.sql.functions as F
df.select(F.least("A","B")).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
We can also pass Column objects instead of column labels:
filter_none
df.select(F.least(df.A,df.B)).show()
+-----------+
|least(A, B)|
+-----------+
| 10|
| 30|
| 40|
+-----------+
Note that we can append the Column returned by least(~) by using withColumn(~):
filter_none
df.withColumn("smallest", F.least("A","B")).show()
+---+---+--------+
| A| B|smallest|
+---+---+--------+
| 20| 10| 10|
| 30| 50| 30|
| 40| 90| 40|
+---+---+--------+
PySpark SQL Functions' length(~) method returns a new PySpark Column holding the lengths
of string values in the specified column.
Parameters
1. col | string or Column
The column whose string values' length will be computed.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Computing the length of column strings in PySpark
To compute the length of each value of the name column, use the length(~) method:
filter_none
import pyspark.sql.functions as F
df.select(F.length("name")).show()
+------------+
|length(name)|
+------------+
| 4|
| 3|
| 5|
+------------+
We could also pass in a Column object instead of a column label like so:
filter_none
# df.select(F.length(df.name)).show()
df.select(F.length(F.col("name"))).show()
+------------+
|length(name)|
+------------+
| 4|
| 3|
| 5|
+------------+
Note that we can append a new column containing the length of the strings
using withColumn(~):
filter_none
df_new = df.withColumn("name_length", F.length("name"))
df_new.show()
+-----+---+-----------+
| name|age|name_length|
+-----+---+-----------+
| Alex| 20| 4|
| Bob| 30| 3|
|Cathy| 40| 5|
+-----+---+-----------+
PySpark SQL Functions | lit method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' lit(~) method creates a Column object with the specified value.
Parameters
1. col | value
A value to fill the column.
Return Value
A Column object.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Creating a column of constants in PySpark DataFrame
To create a new PySpark DataFrame with the name column of df and a new column
called is_single made up of True values:
filter_none
import pyspark.sql.functions as F
df2 = df.select(F.col("name"), F.lit(True).alias("is_single"))
df2.show()
+----+---------+
|name|is_single|
+----+---------+
|Alex| true|
| Bob| true|
+----+---------+
Here, F.lit(True) returns a Column object, which has a method called alias(~) that assigns a
label.
Note that you could append a new column of constants using the withColumn(~) method:
filter_none
import pyspark.sql.functions as F
df = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])
df_new = df.withColumn("is_single", F.lit(True))
df_new.show()
+----+---+---------+
|name|age|is_single|
+----+---+---------+
|Alex| 20| true|
| Bob| 30| true|
+----+---+---------+
Creating a column whose values are based on a condition in PySpark
We can also use lit(~) to create a column whose values depend on some condition:
filter_none
import pyspark.sql.functions as F
col = df.when(F.col("age") <= 20, F.lit("junior")).otherwise(F.lit("senior"))
df3 = df.withColumn("status", col)
df3.show()
+----+---+------+
|name|age|status|
+----+---+------+
|Alex| 20|junior|
| Bob| 30|senior|
+----+---+------+
Note the following:
we are using the when(~) and otherwise(~) pattern to fill the values of the column
conditionally.
we are using the withColumn(~) method to append a new column named status.
the F.lit("junior") can actually be replaced by "junior" - this is just to demonstrate one usage
of lit(~).
PySpark SQL Functions | lower method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' lower(~) method returns a new PySpark Column with the specified
column lower-cased.
Parameters
1. col | string or Column
The column to perform the lowercase operation on.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["ALEX", 25], ["BoB", 30]], ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|ALEX| 25|
| BoB| 30|
+----+---+
Lowercasing strings in PySpark DataFrame
To lower-case the strings in the name column:
filter_none
import pyspark.sql.functions as F
df.select(F.lower(df.name)).show()
+-----------+
|lower(name)|
+-----------+
| alex|
| bob|
+-----------+
Note that passing in a column label as a string also works:
filter_none
import pyspark.sql.functions as F
df.select(F.lower("name")).show()
+-----------+
|lower(name)|
+-----------+
| alex|
| bob|
+-----------+
Replacing column with lowercased column in PySpark
To replace the name column with the lower-cased version, use the withColumn(~):
filter_none
import pyspark.sql.functions as F
df.withColumn("name", F.lower(df.name)).show()
+----+---+
|name|age|
+----+---+
|alex| 25|
| bob| 30|
+----+---+
PySpark SQL Functions | max method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' max(~) method returns the maximum value in the specified column.
Parameters
1. col | string or Column
The column in which to obtain the maximum value.
Return Value
A PySpark Column (pyspark.sql.column.Column).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 25], ["Bob", 30]], ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Getting the maximum value of a PySpark column
To obtain the maximum age:
filter_none
import pyspark.sql.functions as F
df.select(F.max("age")).show()
+--------+
|max(age)|
+--------+
| 30|
+--------+
To obtain the maximum age as an integer:
filter_none
list_rows = df.select(F.max("age")).collect()
list_rows[0][0]
30
Here, the collect() method returns a list of Row objects, which in this case is length one
because the PySpark DataFrame returned by select(~) only has one row. The content of
the Row object can be accessed via [0].
Getting the maximum value of each group in PySpark
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 20, "A"],\
["Bob", 30, "B"],\
["Cathy", 50, "A"]],
["name", "age", "class"])
df.show()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
To get the maximum age of each class:
filter_none
df.groupby("class").agg(F.max("age").alias("MAX AGE")).show()
+-----+-------+
|class|MAX AGE|
+-----+-------+
| A| 50|
| B| 30|
+-----+-------+
Here, we are using the alias(~) method to assign a label to the aggregated age column.
Getting Started with PySpark
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
What is PySpark?
PySpark is an API interface that allows you to write Python code to interact with Apache
Spark, which is an open source distributed computing framework to handle big data. As the
size of data grows year over year, Spark has become a popular framework in the industry to
efficiently process large datasets for streaming, data engineering, real-time analytics,
exploratory data analysis and machine learning.
Why use PySpark?
The core value proposition behind PySpark is that:
Spark partitions the dataset into smaller chunks and stores them in multiple machines. By
doing so, Spark can efficiently process massive volumes of data in parallel. This is extremely
useful when you are dealing with large datasets that cannot fit into the memory of a single
machine.
PySpark can handle a wide array of data formats including Hadoop Distributed File System
(HDFS), Cassandra and Amazon S3.
Anatomy of Spark
The following diagram shows the main components of a Spark application:
Here, we are applying the map(~) transformation which applies a function over each data to
yield a new RDD, and then we perform the filter(~) transformation to obtain a subset of the
data. RDDs are immutable, meaning RDD cannot be modified once created. When you
perform a transformation on a RDD, a new RDD is returned while the original is kept intact.
NOTE
Each newly created RDD holds a reference to the original RDD prior to the transformation.
This allows Spark to keep track of the sequence of transformations, which is referred to
as RDD lineage.
Actions
An action triggers a computation, and returns a value back to the Driver program, or writes
to a stable external storage system:
This should make sense because the data held by the RDD even after applying some
transformation is still partitioned into multiple nodes, and so we would need to aggregate
the outputs into a single place - the driver node in this case.
Examples of actions include show(), reduce() and collect().
WARNING
Since all the data from each node is sent over to the driver with an action, make sure that
the driver node has enough RAM to hold all the incoming data - otherwise, an out-of-
memory error will occur.
Lazy transformations
When you execute the transformation, Spark will not immediately perform the
transformation. Instead, RDD will wait until an actionlink is required, and only then will the
transformation fire. We call this behaviour lazy-execution, and this has the following
benefits:
Scheduling - better usage of cluster usage
Some transformations can grouped together to avoid network traffic
Spark jobs, stages and tasks
When you invoke an action (e.g. count(), take(), collect()) on an RDD, a job is created. Spark
will then internally decompose a job into a single or multiple stages. Next, Spark splits each
stage into tasks, which are units of work that the Spark driver’s scheduler ships
to executors on the worker nodes to handle. Each task processes one unit of partitioned
dataset in its memory.
Executors with one core
As an example, consider the following setup:
Here, our RDD is composed of 6 partitions, with 2 partitions on each worker node.
The executor threads are equipped with one CPU core, which means that only one task can
be performed by each executor at any given time. The total number of tasks is equal to the
number of partitions, which means that there are 6 tasks.
Executors with multiple cores
Multiple tasks can run in parallel on the same executor if you allocate more than one core to
each executor. Consider the following case:
Here, each executor is equipped with 2 cores. The total number of tasks here is 6, which is
the same as the previous case since there are still 6 partitions. With 2 cores,
each executor can handle 2 tasks in parallel. As you can tell from this example, the more
number of cores you allocate to each executor, the more tasks you can perform in parallel.
Number of partitions
In Spark, we can choose the number of partitions by which to divide our dataset. For
instance, should we divide up our data into just a few partitions, or into hundreds of
partitions? We should choose carefully because the number of partitions has an immense
impact on the cluster's performance. As examples, let's explore the case of over-partitioning
and under-partitioning.
Under-partitioning
Consider the following case:
Here, each of our executors is equipped with 10 cores, but only 2 partitions reside at each
node. This means that each executor can tackle the two tasks assigned to it in parallel using
just 2 cores - the other 8 cores remain unused here. In other words, we are not making use
of the available cores here since the number of partitions is too small, that is, we are
underutilising our resources. A better configuration would be to have 10 partitions on each
worker node so that each executor can parse all 10 partitions on their node in parallel.
Excess partitioning
Consider the following case:
Here, we have 6 partitions residing in each worker node, which is equipped with only one
CPU core. The driver would need to create and schedule the same number of tasks as there
are partitions (18 in this case). There is considerable overhead in having to manage and
coordinate many small tasks. Therefore, having a large number of partitions is also not
desirable.
Recommended number of partitions
The official PySpark documentationopen_in_new recommends that there should be 2 to 4
partitions for each core in the executor. An example of this is as follows:
Here, we have 2 partitions per worker node, which holds an executor with one CPU core.
Note that the recommended offered by the official documentation is only a rule of thumb -
you might want to experiment with different number of partitions. For instance, you might
find that assigning two cores for each executor here would boost performance since the 2
partitions can be handled in parallel by the executors.
Next steps
This introductory guide only covered the basics of PySpark. For your next step, we
recommend that you follow our Getting Started with PySpark on Databricks guide to get
some hands-on experience with PySpark programming on Databricks for free. After that,
you can read our Comprehensive guide to RDDs to learn much more about RDDs!
File system in Databricks
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Folder Description
databricks-
Open-source datasets for exploration.
datasets
databricks-
Stores results downloaded via the display(-) method.
results
Behind the scenes, Databricks will upload this file into the driver node's file system instead
of DBFS.
After uploading our hello_world.txt file, we should have the following files in our account
workspace:
Here, we will use our Demo notebook to write some Python code that reads
the hello_world.txt file.
Accessing workspace files programatically
In our Demo notebook, we can check for the existence of our new file by:
filter_none
%sh ls /Workspace/Users/support@skytowner.com
Demo
hello_world.txt
Here, we are using %sh instead of %fs because the workspace files are located in the driver
node's file system.
We can also programatically access the text file like so:
filter_none
with open("/Workspace/Users/support@skytowner.com/hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world
Mounting object storage to DBFS
Mounting Azure blob storage
Let's mount our Azure blob storage to DBFS such that we can access our storage directly
through DBFS. To do so, use the dbutils.fs.mount() method:
filter_none
storage_account = "demostorageskytowner"
container = "democontainer"
access_key = "*****"
dbutils.fs.mount(
source = f"wasbs://{container}@{storage_account}.blob.core.windows.net",
mount_point = "/mnt/my_demo_storage",
extra_configs = {
f"fs.azure.account.key.{storage_account}.blob.core.windows.net": access_key
}
)
True
Here, the access_key can be obtained in the Azure portal:
We can now access all the files in our blob storage using DBFS like so:
filter_none
%fs ls /mnt/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/ my_demo_storage/ 0 0
Let's have a look at what's inside our blob storage:
filter_none
%fs ls /mnt/my_demo_storage/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000
Reading files from mounted storage
Recall that the DBFS is mounted on the driver node as /dbfs, which means we can directly
access the files in our blob storage like so:
filter_none
with open("/dbfs/mnt/my_demo_storage/hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world
Writing files to mounted storage
Let's now write a text file to the mounted storage:
filter_none
with open("/dbfs/mnt/my_demo_storage/bye_world.txt", "w") as file:
file.write("Bye world!")
Let's check that the new file has been written into the mounted storage:
filter_none
%fs ls /mnt/my_demo_storage/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/bye_world.txt bye_world.txt 10 1689253301000
2 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000
We can also check the Azure blob UI dashboard to see that this text file is present:
Unmounting storage
To unmount the storage:
filter_none
dbutils.fs.unmount("/mnt/my_demo_storage")
/mnt/my_demo_storage has been unmounted.
Comprehensive guide on caching in PySpark
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Prerequisites
To follow along with this guide, you should know what RDDs, transformations and actions
are. Please visit our comprehensive guide on RDD if you feel rusty!
What is caching in Spark?
The core data structure used in Spark is the resilient distributed dataset (RDD). There are
two types of operations one can perform on a RDD: a transformation and an action. Most
operations such as mapping and filtering are transformations. Whenever a transformation is
applied to a RDD, a new RDD is made instead of mutating the original RDD directly:
Here, applying the map transformation on the original RDD creates RDD', and then applying
the filter transformation creates RDD''.
Now here is where caching comes into play. Suppose we wanted to apply a transformation
on RDD'' multiple times. Without caching, RDD'' must be computed from scratch
using RDD each time. This means that if we apply a transformation on RDD'' 10 times,
then RDD'' must be generated 10 times from RDD. If we cache RDD'', then we no longer
have to recompute RDD'', but instead reuse the RDD'' that exists in cache. In this way,
caching can greatly speed up your computations and is therefore critical for optimizing your
PySpark code.
How to perform caching in PySpark?
Caching a RDD or a DataFrame can be done by calling the RDD's or
DataFrame's cache() method. The catch is that the cache() method is a transformation (lazy-
execution) instead of an action. This means that even if you call cache() on a RDD or a
DataFrame, Spark will not immediately cache the data. Spark will only cache the RDD by
performing an action such as count():
filter_none
# Cache will be created because count() is an action
df.cache().count()
Here, df.cache() returns the cached PySpark DataFrame.
We could also perform caching via the persist() method. The difference
between count() and persist() is that count() stores the cache using the
setting MEMORY_AND_DISK, whereas persist() allows you to specify storage levels other
than MEMORY_AND_DISK. MEMORY_AND_DISK means that the cache will be stored in
memory if possible, otherwise the cache will be stored in disk. Other storage levels
include MEMORY_ONLY and DISK_ONLY.
Basic example of caching in PySpark
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Let's print out the execution plan of the filter(~) operation that fetches rows where
the age is not 20:
filter_none
df.filter('age != 20').explain()
== Physical Plan ==
*(1) Filter (isnotnull(age#6003L) AND NOT (age#6003L = 20))
+- *(1) Scan ExistingRDD[name#6002,age#6003L]
Here note the following:
the explain() method prints out the physical plan, which you can interpret as the actual
execution plan
the executed plan is often read from bottom to top
we see that PySpark first scans the DataFrame (Scan ExistingRDD). RDD is shown here
instead of DataFrame because, remember, DataFrames are implemented as RDDs under the
hood.
while scanning, the filtering (isnotnull(age) AND NOT (age=20)) is applied.
Let us now cache the PySpark DataFrame returned by the filter(~) method using cache():
filter_none
# Call count(), which is an action, to trigger caching
df.filter('age != 20').cache().count()
2
Here, the count() method is an action, which means that the PySpark DataFrame returned
by filter(~) will be cached.
Let's call filter(~) again and print the physical plan:
filter_none
df.filter('age != 20').explain()
== Physical Plan ==
InMemoryTableScan [name#6002, age#6003L]
+- InMemoryRelation [name#6002, age#6003L], StorageLevel(disk, memory, deserialized,
1 replicas)
+- *(1) Filter (isnotnull(age#6003L) AND NOT (age#6003L = 20))
+- *(1) Scan ExistingRDD[name#6002,age#6003L]
The physical plan is now different from when we called filter(~) before caching. We see two
new operations: InMemoryTableScan and InMemoryRelation. Behind the scenes, the cache
manager checks whether a DataFrame resulting from the same computation exists in cache.
In this case, we have cached the resulting DataFrame from filter('age!=20') previously
via cache() followed by an action (count()), so the cache manager uses this cached
DataFrame instead of recomputing filter('age!=20').
The InMemoryTableScan and InMemoryRelation we see in the physical plan indicate that we
are working with the cached version of the DataFrame.
Using the cached object explicitly
The methods cache() and persist() return a cached version of the RDD or DataFrame. As we
have seen in the above example, we can cache RDDs or DataFrames without explicitly using
the returned cached object:
filter_none
df.filter('age != 20').cache().count()
# Cached DataFrame will be used
df.filter('age != 20').show()
It is better practise to use the cached object returned by cache() like so:
filter_none
df_cached = df.filter('age != 20').cache()
print(df_cached.count())
df_cached.show()
The advantage of this is that calling methods like df_cached.count() clearly indicates that we
are using a cached DataFrame.
Confirming cache via Spark UI
We can also confirm the caching behaviour via the Spark UI by clicking on the Stages tab:
Click on the link provided in the Description column. This should open up a graph that shows
the operations performed under the hood:
You should see a green box in the middle, which means that this specific operation was not
computed thanks to a presence of a cache.
Note that if you are using Databricks, then click on View in the output cell:
This should open up the Spark UI and show you the same graph as above.
We could also see the stored caches on the Storage tab:
We can see that all the partitions of the RDD (8 in this case) resulting from the
operation filter.(age!=20) is stored in memory cache as opposed to disk cache. This is
because the storage level of the cache() method is set to MEMORY_AND_DISK by default,
which means to store the cache in disk only if the cache does not fit in memory.
Clearing existing cache
To clear (evict) all the cache, call the following:
filter_none
spark.catalog.clearCache()
To clear the cache of a specific RDD or DataFrame, call the unpersist() method:
filter_none
df_cached = df.filter('age != 20').cache()
# Trigger an action to persist cache
df_cached.count()
# Delete the cache
df_cached.unpersist()
NOTE
It is good practise to clear cache because if space starts running out, Spark will begin
removing cache using the LRU (least recently used) policy. It is generally better to not rely
on automatic deletion because it may delete cache that is vital for your PySpark application.
Things to consider when caching
Cache computed data that is used frequently
Caching is recommended when you use the same computed RDD or DataFrame multiple
times. Do remember that computing RDDs is generally very fast, so you may consider
caching only when your PySpark program is too slow for your needs.
Cache minimally
We should cache frugally because caching consumes memory, and memory is needed for
the worker nodes to perform their tasks. If we do decide to cache, make sure that you're
only caching the part of data that you will reuse multiple times. For instance, if we are going
to frequently perform some computation on column A only, then it makes sense to cache
column A instead of the entire DataFrame. Another example is if you have two queries
where one involves columns A and B, and the other involves columns B and C, then it may
be a good idea to cache columns A, B and C instead of caching columns (A and B) and
columns (B and C) which will store column B in cache redundantly
Comprehensive Guide to RDD in PySpark
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Prerequisites
You should already be familiar with the basics of PySpark. For a refresher, check out
our Getting Started with PySpark guide.
What is a RDD?
PySpark operates on big data by partitioning the data into smaller subsets spread across
multiple machines. This allows for parallelisation, and this is precisely why PySpark can
handle computations on big data efficiently. Under the hood, PySpark uses a unique data
structure called RDD, which stands for resilient distributed dataset. In essence, RDD is an
immutable data structure in which the data is partitioned across a number of worker nodes
to facilitate parallel operations:
In the diagram above, a single RDD has 4 partitions that is distributed across 3 worker nodes
with the second worker node holding 2 partitions. By definition, a single partition cannot
span across multiple worker nodes. This means, for instance, that partition 2 can never
partially reside in both worker node 1 and 2 - the partition can only reside in either of
worker node 1 or 2. The Driver node serves to coordinate the task execution between these
worker nodes.
Transformations and actions
There are two operations we can perform on a RDD:
Transformations
Actions
Transformations
Transformations are basically functions applied on RDDs, which result in the creation of new
RDDs. RDDs are immutable, which means that even after applying a transformation, the
original RDD is kept intact. Examples of transformations include map(~) and filter(~).
For instance, consider the following RDD transformation:
Here, our RDD has 4 partitions that are distributed across 3 worker nodes. Partition 1 holds
the string a, partition 2 holds the values [d,B] and so on. Suppose we now apply a map
transformation that converts the string into uppercase. After the running the map
transformation, we end up with RDD' shown on the right. What's important here is that
each worker node performs the map transformation on the data it possesses - this is what
makes distributed computing so efficient!
Since transformations return a new RDD, we can keep on applying transformations. The
following example shows the creation of two new RDDs after applying two separate
transformations:
Here, we apply the map(~) transformation to a RDD, which applies a function to each data
in RDD to yield RDD'. Next, we apply the filter(~) transformation to select a subset of the
data in RDD' to finally obtain RDD''.
Spark keeps track of the series of transformations applied to RDD using graphs called RDD
lineage or RDD dependency graphs. In the above diagram, RDD is considered to be a parent
of RDD'. Every child RDD has a reference to its parent (e.g. RDD' will always have a reference
to RDD).
Actions
Actions are operations that either:
send all the data held by multiple nodes to the driver node. For instance, printing some
result in the driver node (e.g. show(~)).
or saving some data on an external storage system such as HDFS and Amazon S3.
(e.g. saveAsTextFile(~)).
Typically, actions are followed by a series of transformations like so:
After applying transformations, the actual data of the output RDD still reside in different
nodes. Actions are used to gather these scattered results in a single place - either the driver
node or an external data storage.
NOTE
Transformations are lazy, which means that even if you call the map(~) function, Spark will
not actually do anything behind the scenes. All transformations are only executed once an
action, such as collect(~), is triggered. This allows Spark to optimise the transformations by:
allocating resource more efficiently
grouping transformations together to avoid network traffic
Example using PySpark
Consider the same set of transformations and action from earlier:
Here, we are first converting each string into uppercase using the transformation map(~),
and then performing a filter(~) transformation to obtain a subset of the data. Finally, we
send the individual results held in different partitions to the driver node to print the final
result on the screen using the action show().
Consider the following RDD with 3 partitions:
filter_none
rdd = sc.parallelize(["Alex","Bob","Cathy"], numSlices=3)
rdd.collect()
['Alex', 'Bob', 'Cathy']
Here:
sc, which stands for SparkContext, is a global variable defined by Databricks.
we are using the parallelize(~) method of SparkContext to create a RDD.
the number of partitions is specified using the numSlices argument.
the collect(~) method is used to gather all the data from each partition to the driver node
and print the results on the screen.
Next, we use the map(~) transformation to convert each string (which resides in different
partitions) to uppercase. We then use the filter(~) transformation to obtain strings that
equal "ALEX":
filter_none
rdd2 = rdd1.map(lambda x: x.upper())
rdd3 = rdd2.filter(lambda name: name == "ALEX")
rdd3.collect()
['ALEX']
NOTE
To run this example, visit our guide Getting Started with PySpark on Databricks.
Narrow and wide transformations
There are two types of transformations:
Narrow - no shuffling is needed, which means that data residing in different nodes do not
have to be transferred to other nodes
Wide - shuffling is required, and so wide transformations are costly
The difference is illustrated below:
For narrow transformations, the partition remains in the same node after the
transformation, that is, the computation is local. In contrast, wide transformations involve
shuffling, which is slow and expensive because of network latency and bandwidth.
Some examples of narrow transformations include map(~) and filter(~). Consider a simple
map operation where we increment an integer of some data by one. It's clear that the each
worker node can perform this on their own since there is no dependency between the
partitions living on other worker nodes.
Some examples of wide transformations include groupBy(~) and sort(~). Suppose we
wanted to perform a groupBy(~) operation on some column, say a categorical variable
consisting of 3 classes: A, B and C. The following diagram illustrates how Spark will execute
this operation:
Notice how groupBy(~) cannot be computed locally because the operation requires
dependency between partitions lying in different nodes.
Fault tolerance property
The R in RDD stands for resilient, meaning that even if a worker node fails, the missing
partition can still be recomputed to recover the RDD with the help of RDD lineage. For
instance, consider the following example:
Suppose RDD'' is "damaged" because of a node failure. Since Spark knows that RDD' is the
parent of RDD'', Spark will be able to re-compute RDD'' from RDD'.
Viewing the underlying partitions of a RDD in PySpark
Let's create a RDD in PySpark by using the parallelize(~) method once again:
filter_none
rdd = sc.parallelize(["a","B","c","D"])
rdd.collect()
['a', 'B', 'c', 'D']
To see the underlying partition of the RDD, use the glom() method like so:
filter_none
rdd.glom().collect()
[[], ['a'], [], ['B'], [], ['c'], [], ['D']]
Here, we see that the RDD has 8 partitions by default. This default number of partitions can
be set in the Spark configuration file. Because our RDD only contains 4 values, we see that
half of the partitions are empty.
We can specify that we want to break down our data into say 3 partitions by supplying
the numSlices parameter:
filter_none
rdd = sc.parallelize(["a","B","c","D"], numSlices=3)
rdd.glom().collect()
[['a'], ['B'], ['c', 'D']]
Difference between RDD and DataFrames
When working with PySpark, we usually use DataFrames instead of RDDs. Similar
to RDDs, DataFrames are also an immutable collection of data, but the key difference is
that DataFrames can be thought of as a spreadsheet-like table where the data is organised
into columns. This does limit the use-case of DataFrames to only structured or tabular data,
but the added benefit is that we can work with our data at a much higher level of
abstraction. If you've ever used a Pandas DataFrame, you'll understand just how easy it is to
interact with your data.
DataFrames are actually built on top of RDDs, but there are still cases when you would
rather work at a lower level and tinker directly with RDDs. For instance, if you are dealing
with unstructured data (e.g. audio and streams of data), you would use RDDs rather
than DataFrames.
NOTE
If you are dealing with structured data, we highly recommend that you
use DataFrames instead of RDDs. This is because Spark will optimize the series of operations
you perform on DataFrames under the hood, but will not do so in the case of RDDs.
Seeing the partitions of a DataFrame
Since DataFrames are built on top of RDDs, we can easily see the underlying RDD
representation of a DataFrame. Let's start by creating a simple DataFrame:
filter_none
columns = ["Name", "Age"]
df = spark.createDataFrame([["Alex", 15], ["Bob", 20], ["Cathy", 30]], columns)
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 15|
| Bob| A| 20|
|Cathy| A| 30|
+-----+-----+---+
To see how this DataFrame is partitioned by its underlying RDD:
filter_none
df.rdd.glom().collect()
[[],
[],
[Row(Name='Alex', Age=15)],
[],
[],
[Row(Name='Bob', Age=20)],
[],
[Row(Name='Cathy', Age=30)]]
We see that our DataFrame is partitioned in terms of Row, which is a native object in
PySpark.
Getting Started with PySpark on Databricks
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Setting up PySpark on Databricks
Databricks is the original creator of Spark and describes themselves as an "open and unified
data analytics platform for data engineering, data science, machine learning and analytics."
The company adds a layer on top of Cloud providers (Microsoft Azure, AWS, Google Cloud),
and manage the Spark cluster on your behalf.
Databricks offers a free tier (community edition) to spin up a node to run some PySpark, and
so this is the best way to gain some hands-on experience with PySpark without having to
install a Linux OS, which is the environment that Spark typically runs in.
Registering to Databricks
Firstly, head over to the Databricks webpageopen_in_new, and fill out the sign up form to
register for the community edition. After receiving a confirmation email from Databricks,
click on the "Get started with Community Edition" link at the bottom instead of choosing a
cloud provider:
WARNING
If you click on a cloud provider, Databricks will create a free-trial account instead of
a community-edition account. A free-trial account is very different from a community-
edition one as you will have to:
set up your own cloud storage on your provider (e.g. Google Cloud Storage)
pay for the resources you consume on your provider
For this reason, we highly recommend to make a community-edition account instead for
learning PySpark.
Environment of community edition
The community edition provides you with:
a single cluster with 15GB of storage
a single driver node equipped with 2 CPUs without any worker nodes
notebooks to write some PySpark code
Creating a cluster
We first need to create a cluster to run PySpark. Head over to the Databricks dashboard,
and click on "Compute" on the left side bar:
Now, click on the "Create Cluster" button, and enter the desired name for your cluster:
Click on the "Create Cluster" button on the top, and this will spin up a free 15GB cluster
consisting of a single driver node without any worker nodes.
WARNING
The cluster in the community edition will automatically terminate after an idle period of two
hours. Terminated clusters cannot be restarted, and so you would have to spin up a new
cluster. In order to set up a new cluster with the same configuration as the terminated one,
click on the terminated cluster and click the "clone" button on top.
We now need to wait 5 or so minutes until the cluster is set up. When the green pending
symbol turns to a green circle, then the cluster is well set up and ready to go!
Creating a notebook
Databricks uses notebooks (similar to JupyterLab) to run PySpark code. To create a new
notebook, click on the following in the left side bar:
Type in the desired name of the notebook, and select the cluster that we created in the
previous step:
The code that we write in this notebook will be in Python, and will be run on the cluster
earlier.
Running our first PySpark code
Now that we have our cluster and notebook set up, we can finally run some PySpark code.
To create a PySpark DataFrame:
filter_none
columns = ["name", "age"]
data = [("Alex", 15), ("Bob", 20), ("Cathy", 25)]
df = spark.createDataFrame(data, columns)
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
Applying a custom function on PySpark Columns with user-defined functions
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
my_udf = udf(my_func)
# Pass in two columns to our my_udf
df_result = df.select(my_udf('name', 'age'))
df_result.show()
+--------------------+
| my_func(name, age)|
+--------------------+
|Alex is 10 years old|
| Bob is 20 years old|
|Cathy is 30 years...|
+--------------------+
Here, note the following:
our custom function my_func(~) now takes in two column values
when calling my_udf(~), we now pass in two columns
Specifying the resulting column type
By default, the column returned will always be of type string regardless of the actual return
type of your custom function. For instance, consider the following custom function:
filter_none
def my_double(int_age):
return 2 * int_age
# Register the function
udf_double = udf(my_double)
df_result = df.select(udf_double('age'))
df_result.show()
+--------------+
|my_double(age)|
+--------------+
| 20|
| 40|
| 60|
+--------------+
Here, the return type of our function my_double(~) is obviously an integer, but the resulting
column type is actually set to a string:
filter_none
df_result.printSchema()
root
|-- my_double(age): string (nullable = true)
We can specify the resulting column type using the second argument in udf(~):
filter_none
udf_double = udf(my_double, 'int')
df_result = df.select(udf_double('age'))
df_result.printSchema()
root
|-- my_double(age): integer (nullable = true)
Here, we have indicated that the resulting column type should be integer.
Equivalently, we could also import an explicit PySpark type like so:
filter_none
from pyspark.sql.types import IntegerType
udf_double = udf(my_double, IntegerType())
df_result = df.select(udf_double('age'))
df_result.printSchema()
root
|-- my_double(age): integer (nullable = true)
Calling user-defined functions in SQL expressions
To use user-defined functions in SQL expressions, register the custom function
using spark.udf.register(~):
filter_none
def to_upper(some_string):
return some_string.upper()
spark.udf.register('udf_upper', to_upper)
df.selectExpr('udf_upper(name)').show()
+---------------+
|udf_upper(name)|
+---------------+
| ALEX|
| BOB|
| CATHY|
+---------------+
Here, the method selectExpr(~) method takes in as argument a SQL expression.
We could also register the DataFrame as a SQL table so that we can run full SQL expressions
like so:
filter_none
# Register PySpark DataFrame as a SQL table
df.createOrReplaceTempView('my_table')
spark.sql('SELECT udf_upper(name) FROM my_table').show()
+---------------+
|udf_upper(name)|
+---------------+
| ALEX|
| BOB|
| CATHY|
+---------------+
Specifying the return type
Again, the type of the resulting column is string regardless of what your custom function
returns. Just like we did earlier when registering with udf(~), we can specify the type of the
returned column like so:
filter_none
def my_double(int_age):
return 2 * int_age
window = Window.partitionBy("group")
df.withColumn("MAX", F.max(F.col("age")).over(window)).show()
+-----+-----+---+---+
| name|group|age|MAX|
+-----+-----+---+---+
| Alex| A| 20| 30|
| Bob| A| 30| 30|
|Cathy| B| 40| 40|
| Dave| B| 40| 40|
+-----+-----+---+---+
Here, note the following:
the original rows are kept intact.
we computed some statistic (max(~)) about how each row relates to the other rows within
its group.
we can also use other aggregate functions such as min(~), avg(~), sum(~).
NOTE
We could also partitionBy(~) on multiple columns by passing in a list of column labels.
Assigning row numbers within groups
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 30], ["Bob", "A", 20], ["Cathy", "B", 40], ["Dave",
"B", 40]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 30|
| Bob| A| 20|
|Cathy| B| 40|
| Dave| B| 40|
+-----+-----+---+
We can sort the rows of each group by using the orderBy(~) function:
filter_none
window = Window.partitionBy("group").orderBy("age") # ascending order by default
To create a new column called ROW NUMBER that holds the row number of every row
within each group:
filter_none
df.withColumn("ROW NUMBER", F.row_number().over(window)).show()
+-----+-----+---+----------+
| name|group|age|ROW NUMBER|
+-----+-----+---+----------+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----------+
Here, Bob is assigned a ROW NUMBER of 1 because we order the grouped rows by
the age column first before assigning the row number.
Ordering by multiple columns
To order by multiple columns, say by "age" first and "name" second:
filter_none
window = Window.partitionBy("group").orderBy("age", "name")
df.withColumn("RANK", F.rank().over(window)).show()
+-----+-----+---+----+
| name|group|age|RANK|
+-----+-----+---+----+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----+
Ordering by descending
By default, the ordering is applied in ascending order. We can perform perform ordering in
descending order like so:
filter_none
window = Window.partitionBy("group").orderBy(F.desc("age"), F.asc("name"))
df.withColumn("RANK", F.rank().over(window)).show()
+-----+-----+---+----+
| name|group|age|RANK|
+-----+-----+---+----+
| Alex| A| 30| 1|
| Bob| A| 20| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 2|
+-----+-----+---+----+
Here, we are ordering by age in descending order and then ordering by name in ascending
order.
Assigning ranks within groups
Consider the same PySpark DataFrame as before:
filter_none
df = spark.createDataFrame([["Alex", "A", 30], ["Bob", "A", 20], ["Cathy", "B", 40], ["Dave",
"B", 40]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 30|
| Bob| A| 20|
|Cathy| B| 40|
| Dave| B| 40|
+-----+-----+---+
Instead of row numbers, let's compute the ranking within each group:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("RANK", F.rank().over(window)).show()
+-----+-----+---+----+
| name|group|age|RANK|
+-----+-----+---+----+
| Bob| A| 20| 1|
| Alex| A| 30| 2|
|Cathy| B| 40| 1|
| Dave| B| 40| 1|
+-----+-----+---+----+
Here, Cathy and Dave both receive a rank of 1 because they have the same age.
Computing lag, lead and cumulative distributions
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 20], ["Bob", "A", 30], ["Cathy", "B", 40], ["Dave",
"B", 50], ["Eric", "B", 60]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| A| 30|
|Cathy| B| 40|
| Dave| B| 50|
| Eric| B| 60|
+-----+-----+---+
Lag function
Let's create a new column where the values of name are shifted down by one for
every group:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("LAG", F.lag(F.col("name")).over(window)).show()
+-----+-----+---+-----+
| name|group|age| LAG|
+-----+-----+---+-----+
| Alex| A| 20| null|
| Bob| A| 30| Alex|
|Cathy| B| 40| null|
| Dave| B| 50|Cathy|
| Eric| B| 60| Dave|
+-----+-----+---+-----+
Here, Bob has a LAG value of Alex because Alex belongs to the same group and is above Bob
when ordered by age.
We can also shift down column values by 2 like so:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("LAG", F.lag(F.col("name"), 2).over(window)).show()
+-----+-----+---+-----+
| name|group|age| LAG|
+-----+-----+---+-----+
| Alex| A| 20| null|
| Bob| A| 30| null|
|Cathy| B| 40| null|
| Dave| B| 50| null|
| Eric| B| 60|Cathy|
+-----+-----+---+-----+
Here, Eric has a LAG value of Cathy because Cathy has been shifted down by 2.
Lead function
The lead(~) function is the opposite of the lag(~) function - instead of shifting down values,
we shift up instead. Here's our DataFrame once again for your reference:
filter_none
df = spark.createDataFrame([["Alex", "A", 20], ["Bob", "A", 30], ["Cathy", "B", 40], ["Dave",
"B", 50], ["Eric", "B", 60]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| A| 30|
|Cathy| B| 40|
| Dave| B| 50|
| Eric| B| 60|
+-----+-----+---+
Let's create a new column called LEAD where the name value is shifted up by one for
every group:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("LEAD", F.lead(F.col("name")).over(window)).show()
+-----+-----+---+----+
| name|group|age|LEAD|
+-----+-----+---+----+
| Alex| A| 20| Bob|
| Bob| A| 30|null|
|Cathy| B| 40|Dave|
| Dave| B| 50|Eric|
| Eric| B| 60|null|
+-----+-----+---+----+
Just as we could do for the lag(~) function, we can add a shift unit like so:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("LEAD", F.lead(F.col("name"), 2).over(window)).show()
+-----+-----+---+----+
| name|group|age|LEAD|
+-----+-----+---+----+
| Alex| A| 20|null|
| Bob| A| 30|null|
|Cathy| B| 40|Eric|
| Dave| B| 50|null|
| Eric| B| 60|null|
+-----+-----+---+----+
Cumulative distribution function
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 20], ["Bob", "B", 30], ["Cathy", "B", 40], ["Dave",
"B", 40], ["Eric", "B", 60]], ["name", "group", "age"])
df.show()
+-----+-----+---+
| name|group|age|
+-----+-----+---+
| Alex| A| 20|
| Bob| B| 30|
|Cathy| B| 40|
| Dave| B| 40|
| Eric| B| 60|
+-----+-----+---+
To get the cumulative distribution of age of each group:
filter_none
window = Window.partitionBy("group").orderBy("age")
df.withColumn("CUMULATIVE DIS", F.cume_dist().over(window)).show()
+-----+-----+---+--------------+
| name|group|age|CUMULATIVE DIS|
+-----+-----+---+--------------+
| Alex| A| 20| 1.0|
| Bob| B| 30| 0.25|
|Cathy| B| 40| 0.75|
| Dave| B| 40| 0.75|
| Eric| B| 60| 1.0|
+-----+-----+---+--------------+
Here, Cathy and Dave have a CUMULATIVE DIS value of 0.75 because their age value is equal
to or greater than 75% of the age values within that group.
Specifying range using rangeBetween
We can use the rangeBetween(~) method to only consider rows whose specified column
value is within a given range. For example, consider the following DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 15], ["Bob", "A", 20], ["Cathy", "A", 30], ["Dave",
"A", 30], ["Eric", "B", 30]], ["Name", "Group", "Age"])
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 15|
| Bob| A| 20|
|Cathy| A| 30|
| Dave| A| 30|
| Eric| B| 30|
+-----+-----+---+
To compute a moving average of Age with rows whose Age value satisfies some range
condition:
filter_none
window = Window.partitionBy("Group").orderBy("Age").rangeBetween(start=-5, end=10)
df.withColumn("AVG", F.avg(F.col("Age")).over(window)).show()
+-----+-----+---+-----+
| Name|Group|Age| AVG|
+-----+-----+---+-----+
| Alex| A| 15| 17.5|
| Bob| A| 20|23.75|
|Cathy| A| 30| 30.0|
| Dave| A| 30| 30.0|
| Eric| B| 30| 30.0|
+-----+-----+---+-----+
In the beginning, the first row with Age=15 is selected and we scan for rows where
the Age value is between 15-5=10 and 15+10=25. Since Bob's row satisfies this condition,
the aggregate function (averaging in this case) takes in as input Alex's row (the current row)
and Bob's row:
Here:
the blue row indicates the current row.
the red row represents a row that satisfies the range condition.
Next, the second row with Age=20 is selected. Similarly, we scan for rows where the Age is
between 20-5=15 and 20+10=30 and compute the aggregate function based on the satisfied
rows:
Here, 23.75 is the average of 15, 20, 30 and 30. Note that Eric's row is not included in the
calculation even though his Age is 30 because he belongs to a different group.
As one last example, here's what would happen for the next row:
Once we repeat this process for the rest of the rows and all other groups, we end up with:
Specifying rows using rowBetween
We can use the rowsBetween(~) method to specify how many preceding and subsequent
rows we wish to consider when computing our aggregate function. For example, consider
the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 10], ["Bob", "A", 20], ["Cathy", "A", 30], ["Dave",
"A", 40], ["Eric", "B", 50]], ["Name", "Group", "Age"])
df.show()
+-----+-----+---+
| Name|Group|Age|
+-----+-----+---+
| Alex| A| 10|
| Bob| A| 20|
|Cathy| A| 30|
| Dave| A| 40|
| Eric| B| 50|
+-----+-----+---+
To use 1 preceding row and 2 subsequent rows in the calculation of our aggregate function:
filter_none
window = Window.partitionBy("Group").orderBy("Age").rowsBetween(start=-1, end=2)
df.withColumn("AVG", F.avg(F.col("Age")).over(window)).show()
+-----+-----+---+----+
| Name|Group|Age| AVG|
+-----+-----+---+----+
| Alex| A| 10|20.0|
| Bob| A| 20|25.0|
|Cathy| A| 30|30.0|
| Dave| A| 40|35.0|
| Eric| B| 50|50.0|
+-----+-----+---+----+
Here, note the following:
Alex's row has no preceding row but has 2 subsequent rows (Bob and Cathy's row). This
means that Alex's AVG value is 20 because (10+20+30)/3=20.
Bob's row has one preceding row and 2 subsequent rows. This means that Bob's AVG value
is 25 because (10+20+30+40)/4=25.
Using window functions to preserve ordering when collect_list
Window functions can also be used to preserver ordering when performing
a collect_list(~) operation. The conventional way of calling collect_list(~) is with groupBy(~).
For example, consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame([["Alex", "A", 2], ["Bob", "A", 1], ["Cathy", "B",1], ["Doge",
"A",3]], ["name", "my_group", "rank"])
df.show()
+-----+--------+----+
| name|my_group|rank|
+-----+--------+----+
| Alex| A| 2|
| Bob| A| 1|
|Cathy| B| 1|
| Doge| A| 3|
+-----+--------+----+
To collect all the names for each group in my_group as a list:
filter_none
df_result = df.groupBy("my_group").agg(F.collect_list("name").alias("name"))
df_result.show()
+--------+-----------------+
|my_group| name|
+--------+-----------------+
| A|[Alex, Bob, Doge]|
| B| [Cathy]|
+--------+-----------------+
This solution is acceptable only in the case when the ordering of the elements in the
collected list does not matter. In this particular case, we get the order [Alex, Bob, Doge] but
there is no guarantee that this will always be the output every time. This is because
the groupBy(~) operation shuffles the data across the worker nodes, and then Spark
appends values to the list in a non-deterministic order.
In the case when the ordering of the elements in the list matters, we can
use collect_list(~) over a window partition like so:
filter_none
w = Window.partitionBy("my_group").orderBy("rank")
df_result = df.withColumn("result", F.collect_list("name").over(w))
df_final_result = df_result.groupBy("my_group").agg(F.max("result").alias("result"))
df_final_result.show()
+--------+-----------------+
|my_group| result|
+--------+-----------------+
| A|[Bob, Alex, Doge]|
| B| [Cathy]|
+--------+-----------------+
Here, we've first defined a window partition based on my_group, which is ordered by rank.
We then directly use the collect_list(~) over this window partition to generate the following
intermediate result:
filter_none
df_result.show()
+-----+--------+----+-----------------+
| name|my_group|rank| result|
+-----+--------+----+-----------------+
| Bob| A| 1| [Bob]|
| Alex| A| 2| [Bob, Alex]|
| Doge| A| 3|[Bob, Alex, Doge]|
|Cathy| B| 1| [Cathy]|
+-----+--------+----+-----------------+
Remember, window partitions do not aggregate values, that is, the number of rows of the
resulting DataFrames will remain the same.
Finally, we group by my_group and fetch the row with the longest list for each group
using F.max(~) to obtain the desired output.
Note that we could also add a filtering condition for collect_list(~) like so:
filter_none
w = Window.partitionBy("my_group").orderBy("rank")
df_result = df.withColumn("result", F.collect_list(F.when(F.col("name") != "Alex",
F.col("name"))).over(w))
df_final_result = df_result.groupBy("my_group").agg(F.max("result").alias("result"))
df_final_result.show()
+--------+-----------+
|my_group| result|
+--------+-----------+
| A|[Bob, Doge]|
| B| [Cathy]|
+--------+-----------+
Here, we are collecting names as a list for each group while filtering out the name Alex.
Using SQL against a PySpark DataFrame
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' mean(~) method returns the mean value in the specified column.
Parameters
1. col | string or Column
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
import pyspark.sql.functions as F
df.select (F.mean("age")).show ()
+--------+
|avg(age)|
+--------+
| 27.5|
+--------+
list_rows = df.select(F.mean("age")).collect()
list_rows[0][0]
27.5
Here, we are converting the PySpark DataFrame returned from select(~) into a list of Row objects
using the collect() method. This list is guaranteed to be of size one because the mean(~) reduces
column values into a single number. To access the content of the Row object, we use another [0] .
df.show ()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
+-----+--------+
|class|MEAN AGE|
+-----+--------+
| A| 35.0|
| B| 30.0|
+-----+--------+
Here, we are using alias("MEAN AGE") to assign a label to the aggregated age column.
Using SQL against a PySpark DataFrame
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Here, we have registered the DataFrame as a SQL table called users . The temporary table will be
dropped whenever the Spark session ends. On the other hand, createGlobalTempView(~) will be
shared across Spark sessions, and will only be dropped whenever the Spark application ends.
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
WARNING
Only read-only SQL statements are allowed - data manipulation language (DML) statements such
as UPDATE and DELETE are not supported since PySpark has no notion of transactions.
df_res = spark.sql(query)
df_res.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' min(~) method returns the minimum value in the specified column.
Parameters
1. col | string or Column
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
import pyspark.sql.functions as F
df.select (F.min("age")).show ()
+--------+
|min(age)|
+--------+
| 25|
+--------+
list_rows[0][0]
25
the collect() method converts the PySpark DataFrame returned by select(~) to a list
of Row objects.
this list will always be of length one when we apply the min(~) method.
df.show ()
+-----+---+-----+
| name|age|class|
+-----+---+-----+
| Alex| 20| A|
| Bob| 30| B|
|Cathy| 50| A|
+-----+---+-----+
+-----+-------+
|class|MIN AGE|
+-----+-------+
| A| 20|
| B| 30|
+-----+-------+
Here, the alias(~) method is used to assign a label to the aggregated age column
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' month(~) method extracts the month component of each column value,
which can be of type string or date.
Parameters
1. col | string or Column
Return Value
A Column object of integers.
Examples
Consider the following PySpark DataFrame with some datetime values:
filter_none
import datetime
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
df.select (F.month("birthday").alias("month")).show ()
+-----+
|month|
+-----+
| 12|
| 5|
+-----+
Here, we are assigning the name "month" to the Column object returned by month(~) .
df.select (F.month("birthday").alias("day")).show ()
+-----+
|month|
+-----+
| 12|
| 5|
+-----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.
Parameters
1. str | string or Column
3. idx | int
The group from which to extract values. Consult the examples below for clarification.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+--------+---+
| id|age|
+--------+---+
|id_20_30| 10|
|id_40_50| 30|
+--------+---+
To extract the first number in each id value, use regexp_extract(~) like so:
filter_none
+----------------------------+
+----------------------------+
| 20|
| 40|
+----------------------------+
Here, the regular expression (\d+) matches one or more digits ( 20 and 40 in this case). We set the
third argument value as 1 to indicate that we are interested in extracting the first matched group -
this argument is useful when we capture multiple groups.
We can use multiple (~) capture groups for regexp_extract(~) like so:
filter_none
+----------------------------------+
+----------------------------------+
| 30|
| 50|
+----------------------------------+
Here, we set the third argument value to 2 to indicate that we are interested in extracting the
values captured by the second group.
R E L AT E D
PySpark SQL Functions' regexp_replace(~) method replaces the matched regular expression with the
specified string.
chevro
PySpark SQL Functions | regexp_replace method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' regexp_replace(~) method replaces the matched regular expression with
the specified string.
Parameters
3. replacement | string
Return Value
Examples
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 10|
|Mile| 30|
+----+---+
+--------+
|new_name|
+--------+
| ALEx|
| MiLE|
+--------+
NOTE
The second argument is a regular expression, so characters such as $ and [ will carry special
meaning. In order to treat these special characters as literal characters, escape them using
the \ character (e.g. \$).
Passing in a Column object
Instead of referring to the column by its name, we can also pass in a Column object:
filter_none
|new_name|
+--------+
| ALEx|
| MiLE|
+--------+
We can use the PySpark DataFrame's withColumn(~) method to obtain a new PySpark DataFrame
with the updated column like so:
filter_none
+----+---+
|name|age|
+----+---+
|ALEx| 10|
|MiLE| 30|
+----+---+
To replace the substring 'le' that occur only at the end with 'LE', use regexp_replace(~):
filter_none
+--------+
|new_name|
+--------+
| Alex|
| MiLE|
+--------+
Here, we are using the special regular expression character '$' that only matches patterns
occurring at the end of the string. This is the reason no replacement was done for
the 'le' in Alex.
R E L AT E D
PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Parameters
1. col | string or Column
2. n | int
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
import pyspark.sql.functions as F
+---------------+
|repeat(name, 2)|
+---------------+
| AlexAlex|
| BobBob|
+---------------+
Note that we could also supply a Column object to repeat(~) like so:
filter_none
import pyspark.sql.functions as F
# df.select(F.repeat(df.name), 2)).show()
+---------------+
|repeat(name, 2)|
+---------------+
| AlexAlex|
| BobBob|
+---------------+
PySpark SQL Functions | round method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' round(~) method rounds the values of the specified column.
Parameters
1. col | string or Column
If scale is positive, such as scale=2 , then values are rounded to the nearest 2nd decimal. If scale is
negative, such as scale=-1 , then values are rounded to the nearest tenth. By default, scale=0 , that is,
values are rounded to the nearest integer.
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+------+
| name|salary|
+-----+------+
| Alex| 90.4|
| Bob| 100.5|
|Cathy|100.63|
+-----+------+
import pyspark.sql.functions as F
df.select (F.round("salary")).show ()
+----------------+
|round(salary, 0)|
+----------------+
| 90.0|
| 101.0|
| 101.0|
+----------------+
+----------------+
|round(salary, 1)|
+----------------+
| 90.4|
| 100.5|
| 100.6|
+----------------+
+-----------------+
|round(salary, -1)|
+-----------------+
| 90.0|
| 100.0|
| 100.0|
+-----------------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' split(~) method returns a new PySpark column of arrays containing
splitted tokens based on the specified delimiter.
Parameters
1. str | string or Column
2. pattern | string
if limit > 0 , then the resulting array of splitted tokens will contain at most limit tokens.
By default, limit=-1 .
Return Value
A new PySpark column.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-------+
| x|
+-------+
| A#A|
| B##B|
|#C#C#C#|
| null|
+-------+
+---------------+
|split(x, #, -1)|
+---------------+
| [A, A]|
| [B, , B]|
| [, C, C, C, ]|
| null|
+---------------+
We can also specify the maximum number of splits to perform using the optional parameter limit :
filter_none
+--------------+
|split(x, #, 2)|
+--------------+
| [A, A]|
| [B, #B]|
| [, C#C#C#]|
| null|
+--------------+
Here, the array containing the splitted tokens can be at most length 2 . This is the reason why we
still see our delimiter substring "#" in there.
df.show ()
+----+
| x|
+----+
| A#A|
| B@B|
|C#@C|
+----+
To split by either the characters # or @ , we can use a regular expression as the delimiter:
filter_none
+------------------+
+------------------+
| [A, A]|
| [B, B]|
| [C, , C]|
+------------------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' to_date() method converts date strings to date types.
Parameters
1. col | Column
2. format | string
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some date strings:
filter_none
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
To convert date strings in the birthday column to actual date type, use to_date(~) and specify the
pattern of the date string:
filter_none
df_new.printSchema ()
root
Here, the withColumn(~) method is used to update the birthday column using the new column
returned by to_date(~) .
As another example, here's a PySpark DataFrame with slightly more complicated date strings:
filter_none
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
Here, our date strings also contain hours, minutes and seconds.
df_new.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1998-05-06|
+----+----------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' translate(~) method replaces the specified characters by the desired
characters.
Parameters
1. srcCol | string or Column
2. matching | string
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
A -> #
e -> @
o -> %
import pyspark.sql.functions as F
+-------------------------+
| #l@x|
| B%b|
| Cathy|
+-------------------------+
Note that we can obtain a new PySpark DataFrame with the translated column using
the withColumn(~) method:
filter_none
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| #l@x| 20|
| B%b| 30|
|Cathy| 40|
+-----+---+
Finally, note that specifying less characters for the replace parameter will result in the removal of
the corresponding characters in matching :
filter_none
+-----------------------+
+-----------------------+
| #lx|
| Bb|
| Cathy|
+-----------------------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' trim(~) method returns a new PySpark column with the string values
trimmed, that is, with the leading and trailing spaces removed.
Parameters
1. col | string
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([[" Alex ", 20], [" Bob", 30], ["Cathy ", 40]], ["name", "age"])
df.show ()
+---------+---+
| name|age|
+---------+---+
| Alex | 20|
| Bob| 30|
|Cathy | 40|
+---------+---+
Here, the values in the name column have leading and trailing spaces.
Trimming columns in PySpark
To trim the name column, that is, to remove the leading and trailing spaces:
filter_none
import pyspark.sql.functions as F
df.select (F.trim("name").alias("trimmed_name")).show ()
+------------+
|trimmed_name|
+------------+
| Alex|
| Bob|
| Cathy|
+------------+
Here, the alias(~) method is used to assign a label to the Column returned by trim(~) .
To get the original PySpark DataFrame but with the name column updated with the trimmed
version, use the withColumn(~) method:
filter_none
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' upper(~) method returns a new PySpark Column with the specified column
upper-cased.
Parameters
1. col | string or Column
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|alex| 25|
| bOb| 30|
+----+---+
import pyspark.sql.functions as F
df.select (F.upper(df.name)).show ()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
import pyspark.sql.functions as F
df.select (F.upper("name")).show ()
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
+-----------+
To replace the name column with the upper-cased version, use the withColumn(~) method:
filter_none
import pyspark.sql.functions as F
+----+---+
|name|age|
+----+---+
|ALEX| 25|
| BOB| 30|
+----+---+
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' when(~) method is used to update values of a PySpark DataFrame column
to other values based on the given conditions.
NOTE
The when(~) method is often used in conjunction with the otherwise(~) method to implement an if-
else logic. See examples below for clarification.
Parameters
1. condition | Column | optional
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
import pyspark.sql.functions as F
+-----------------------------------------------+
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
Notice how we used the method otherwise(~) to set values for cases when the conditions are not
met.
Note that if you do not include the otherwise(~) method, then any value that does not fulfil the if
condition will be assigned null :
filter_none
+-------------------------------------+
+-------------------------------------+
| Doge|
| null|
| null|
+-------------------------------------+
Specifying multiple conditions
Using pipeline and ampersand operator
We can combine conditions using & (and) and | (or) like so:
filter_none
+----+---+
|name|age|
+----+---+
|Doge| 20|
|Eric| 24|
|Eric| 22|
+----+---+
.otherwise ("Eric")).show ()
+----------------------------------------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge WHEN (name = Bob) THEN Zebra ELSE Eric END|
+----------------------------------------------------------------------------+
| Doge|
| Zebra|
| Eric|
+----------------------------------------------------------------------------+
import pyspark.sql.functions as F
df.select (F.when(df.age > 15, df.age + 30)).show ()
+----------------------------------------+
+----------------------------------------+
| 50|
| 54|
| 52|
+----------------------------------------+
Using an alias
import pyspark.sql.functions as F
+-----------------------------------------------+
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
import pyspark.sql.functions as F
+--------+
|new_name|
+--------+
| Doge|
| Eric|
| Eric|
+--------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's rdd property returns the RDD representation of the DataFrame. Keep in
mind that PySpark DataFrames are internally represented as RDD.
Return Value
RDD containing Row objects.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To convert our PySpark DataFrame into a RDD, use the rdd property:
filter_none
rdd = df.rdd
rdd.collect ()
Here, we are using the collect() method to see the content of our RDD, which is a list
of Row objects.
PySpark DataFrame | alias method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's alias(~) method gives an alias to the DataFrame that you can then refer to in
string statements.
Parameters
This method does not take any parameters.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Let's give an alias to our DataFrame, and then refer to the DataFrame using the alias:
filter_none
df = df.alias("my_df")
+----+
|name|
+----+
|Alex|
| Bob|
+----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's coalesce(~) method reduces the number of partitions of the PySpark
DataFrame without shuffling.
Parameters
1. num_partitions | int
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
The default number of partitions is governed by your PySpark configuration. In my case, the
default number of partitions is:
filter_none
df.rdd .getNumPartitions ()
We can see the actual content of each partition of the PySpark DataFrame by using the underlying
RDD's glom() method:
filter_none
[[],
[],
[Row(name='Alex', age=20)],
[],
[],
[Row(name='Bob', age=30)],
[],
[Row(name='Cathy', age=40)]]
We can see that we indeed have 8 partitions, 3 of which contain a Row .
To reduce the number of partitions of the DataFrame without shuffling link, use coalesce(~) :
filter_none
df_new = df.coalesce(2)
[[Row(name='Alex', age=20)],
NOTE
Both the methods repartition(~) and coalesce(~) are used to change the number of partitions, but here
are some notable differences:
generally results in a shuffling operation link while coalesce(~) does not. This
repartition(~)
means that coalesce(~) is less costly than repartition(~) because the data does not have to
travel across the worker nodes much.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's collect() method returns all the records of the DataFrame as a list
of Row objects.
Return Value
A list of Row objects.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df.collect()
WARNING
Under the hood, the collect(~) method sends all the data scattered across the worker nodes to the
main deriver node. This means that if the size of the data is large, then the driver program will
run out of memory and throw an error.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's colRegex(~) method returns a Column object whose label match the specified
regular expression. This method also allows multiple columns to be selected.
Parameters
1. colName | string
Return Value
A PySpark Column.
Examples
Selecting columns using regular expression in PySpark
df.show ()
+-----+----+
| col1|col2|
+-----+----+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+----+
+-----+----+
| col1|col2|
+-----+----+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+----+
the regular expression col[123] matches columns with label col1 , col2 or col3 .
the select(~) method is used to convert the Column object into a PySpark DataFrame.
Getting column labels that match regular expression as list of strings in PySpark
df.select (df.colRegex("`col[123]`")).columns
['col1', 'col2']
Here, we are using the columns property of the PySpark DataFrame returned by select(~) .
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's corr(~) method returns the correlation of the specified numeric columns
as a float.
Parameters
1. col1 | string
2. col2 | string
The type of correlation to compute. The only correlation type supported currently is
the Pearson Correlation Coefficient.
Return Value
A float.
Examples
df = spark.createDataFrame([("Alex", 180, 80), ("Bob", 170, 70), ("Cathy", 160, 70)], ["name", "height", "weight"])
df.show()
+-----+------+------+
| name|height|weight|
+-----+------+------+
+-----+------+------+
df.corr("height","weight")
0.8660254037844387
Here, we see that the height and weight are positively correlated with a Pearson correlation
coefficient of around 0.87.
R E L AT E D
PySpark DataFrame's cov(~) method returns the covariance of two specified numeric columns as a
float.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's count(~) method returns the number of rows of the DataFrame.
Parameters
This method does not take in any parameters.
Return Value
An integer.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df.count()
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's cov(~) method returns the covariance of two specified numeric columns as a
float.
Parameters
1. col1 | string
2. col2 | string
Return Value
A float .
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([("Alex", 180, 80), ("Bob", 170, 70), ("Cathy", 160, 70)], ["name", "height", "weight"])
df.show ()
+-----+------+------+
| name|height|weight|
+-----+------+------+
+-----+------+------+
Computing the covaraince of two numeric PySpark columns
df.cov("height","weight")
50.0
Here, we see that the covariance between height and weight is 50 (positive correlation).
R E L AT E D
PySpark DataFrame's corr(~) method returns the correlation of the specified numeric columns as a
float.
chevr
PySpark DataFrame | describe method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's describe(~) method returns a new PySpark DataFrame holding summary
statistics of the specified columns.
Parameters
1. *cols | string | optional
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 25], ["Bob", 30]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 25|
| Bob| 30|
+----+---+
df.describe("name","age").show ()
+-------+----+----+
|summary|name| age|
+-------+----+----+
| count| 3| 3|
| mean|null|25.0|
| stddev|null| 5.0|
| min|Alex| 20|
+-------+----+----+
Getting summary statistics of all numeric and string columns in PySpark DataFrame
df.describe().show ()
+-------+----+----+
|summary|name| age|
+-------+----+----+
| count| 3| 3|
| mean|null|25.0|
| stddev|null| 5.0|
| min|Alex| 20|
+-------+----+----+
R E L AT E D
PySpark DataFrame's summary(~) method returns a PySpark DataFrame containing basic summary
statistics of numeric columns.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's distinct() method returns a new DataFrame containing distinct rows.
Parameters
This method does not take in any parameters.
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Alex", 25], ["Alex", 50]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 25|
|Alex| 50|
+----+---+
To get all distinct rows of a PySpark DataFrame, use the distinct() method:
filter_none
df.distinct().show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
|Alex| 50|
+----+---+
df.distinct().count ()
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's drop(~) method returns a new DataFrame with the specified columns
dropped.
NOTE
Trying to drop a column that does not exist will not raise an error - the original DataFrame will
be returned instead.
Parameters
1. *cols | string or Column
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 25, True], ["Bob", 30, False]], ["name", "age", "is_married"])
df.show ()
+----+---+----------+
|name|age|is_married|
+----+---+----------+
+----+---+----------+
df.drop("name").show ()
+---+----------+
|age|is_married|
+---+----------+
| 25| true|
| 30| false|
+---+----------+
import pyspark.sql.functions as F
df.drop(F.col("name")).show ()
+---+----------+
|age|is_married|
+---+----------+
| 25| true|
| 30| false|
+---+----------+
df.drop("name", "age").show ()
+----------+
|is_married|
+----------+
| true|
| false|
+----------+
WARNING
df.drop(F.col("name"), F.col("age")).show ()
df.drop(*cols).show ()
+----------+
|is_married|
+----------+
| true|
| false|
+----------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's dropDuplicates(~) returns a new DataFrame with duplicate rows removed. We
can optionally specify columns to check for duplicates.
NOTE
dropDuplicates(~) is an alias for drop_duplicates(~) .
Parameters
1. subset | string or list of string | optional
The columns by which to check for duplicates. By default, all columns will be checked.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 25], ["Bob", 30], ["Bob", 30], ["Cathy", 25]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
| Bob| 30|
|Cathy| 25|
+-----+---+
df.dropDuplicates().show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 25|
+-----+---+
only the first occurrence is kept while subsequent occurrences are removed.
df.dropDuplicates(["age"]).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Again, only the first occurrence is kept while the latter duplicate rows are discarded.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Parameters
1. how | string | optional
By default, how='any' .
Drop rows that have less non-null values than thresh . Note that this overrides the how parameter.
The rows to check for null values. By default, all rows will be checked.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
df.dropna().show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
n_non_missing_vals = 2
df.dropna(thresh=n_non_missing_vals).show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
n_missing_vals = 2
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
df.dropna(how='all').show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
df.dropna(subset='age').show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows where certain values are missing (either) in PySpark DataFrame
To drop rows where either the name or age column value is missing:
filter_none
df.dropna(subset=['name','age'], how='any').show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Dropping rows where certain values are missing (all) in PySpark DataFrame
To drop rows where the name and age column values are both missing:
filter_none
df.dropna(subset=['name','age'], how='all').show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's exceptAll(~) method returns a new DataFrame that exist in this DataFrame
but not in the other DataFrame.
Parameters
1. other | PySpark DataFrame
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
df_other = spark.createDataFrame ([["Alex", 20], ["Bob", 35], ["Cathy", 40]], ["name", "age"])
df_other.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 35|
|Cathy| 40|
+-----+---+
Getting all rows of PySpark DataFrame that do not exist in another PySpark DataFrame
df.exceptAll(df_other).show ()
+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's filter(~) method returns the rows of the DataFrame that satisfies the given
condition.
NOTE
Parameters
1. condition | Column or string
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's foreach(~) method loops over each row of the DataFrame as a Row object and
applies the given function to the row.
WARNING
the foreach(~) method in Spark is invoked in the worker nodes instead of the Driver
program. This means that if we perform a print(~) inside our function, we will not be able
to see the printed results in our session or notebook because the results are printed in the
worker node instead.
rows are read-only and so you cannot update values of the rows.
Given these limitations, the foreach(~) method is mainly used for logging some information about
each row to the local machine or to an external database.
Parameters
1. f | function
Return Value
Nothing is returned.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30]], ["name", "age"])
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
print (row.name)
df.foreach(f)
Here, the row.name is printed in the worker nodes so you would not see any output in the driver
program.
PySpark DataFrame | groupBy method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's groupBy(~) method aggregates rows based on the specified columns. We can
then compute statistics such as the mean for each of these groups.
Parameters
1. cols | list or string or Column | optional
The columns to group by. By default, all rows will be grouped together.
Return Value
The GroupedData object ( pyspark.sql.group.GroupedData ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+----------+---+------+
| name|department|age|salary|
+-----+----------+---+------+
+-----+----------+---+------+
Basic usage
By default, groupBy() without any arguments will group all rows together, and will compute
statistics for each numeric column:
filter_none
df.groupby().max().show()
+--------+-----------+
|max(age)|max(salary)|
+--------+-----------+
| 24| 600|
+--------+-----------+
Grouping by a single column and computing statistic of all columns of each group
df.groupBy("department").max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
+----------+--------+-----------+
Instead of referring to the column by its label ( string ), we can also use SQL.functions.col(~) :
filter_none
from pyspark.sql import functions as F
df.groupby(F.col ("department")).max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
+----------+--------+-----------+
Grouping by a single column and computing statistic of specific columns of each group
df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
Equivalently, we can use the agg(~) method and use one of SQL.functions ' aggregate functions:
filter_none
df.groupby("department").agg(F.max("age")).show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
NOTE
The following aggregate functions are supported in PySpark:
filter_none
agg, avg, count, max, mean, min, pivot, sum
By default, computing the max age of each group will result in the column label max(age) :
filter_none
df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max_age")).show()
+----------+-------+
|department|max_age|
+----------+-------+
| IT| 24|
| HR| 22|
+----------+-------+
import pyspark.sql.functions as F
+----------+--------+--------+-----------------+
|department| max|min(age)| avg(salary)|
+----------+--------+--------+-----------------+
+----------+--------+--------+-----------------+
df.show ()
+-----+--------+----------+---+------+
| name|position|department|age|salary|
+-----+--------+----------+---+------+
+-----+--------+----------+---+------+
To group by position and department , and then computing the max age of each of these groups:
filter_none
df.groupby(["position", "department"]).max("age").show()
+--------+----------+--------+
|position|department|max(age)|
+--------+----------+--------+
| junior| IT| 24|
+--------+----------+--------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's head(~) method returns the first n number of rows as Row objects.
Parameters
1. n | int | optional
Return Value
If n is larger than 1, then a list of Row objects is returned
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
df.head()
Row(name='Alex', age=15)
df.head(n=2)
R E L AT E D
PySpark DataFrame's take(~) method returns the first num number of rows as a list of Row objects.
chevron_right
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's intersect(~) method returns a new PySpark DataFrame with rows that exist in
another PySpark DataFrame. Note that unlike intersectAll(~) , intersect(~) only includes duplicate
rows once.
NOTE
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
df_other = spark.createDataFrame ([("Alex", 20), ("Doge", 30), ("eric", 40)], ["name", "age"])
df_other.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
|Doge| 30|
|eric| 40|
+----+---+
To get rows of a PySpark DataFrame that exist in another PySpark DataFrame, use
the intersect(~) method like so:
filter_none
df_intersect = df.intersect(df_other)
df_intersect.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, we get this row because both PySpark DataFrames contained this row.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's intersectAll(~) method returns a new PySpark DataFrame with rows that also
exist in the other PySpark DataFrame. Unlike intersect(~) , the intersectAll(~) method preserves
duplicates.
NOTE
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([("Alex", 20), ("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
df_other = spark.createDataFrame ([("Alex", 20), ("Alex", 20), ("David", 80), ("Eric", 80)], ["name", "age"])
df_other.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Alex| 20|
|David| 80|
| Eric| 80|
+-----+---+
To get rows that also exist in other PySpark DataFrame while preserving duplicates:
filter_none
df_res = df.intersectAll(df_other)
df_res.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
|Alex| 20|
+----+---+
Alex 's row is duplicated because Alex 's row appears twice in df and df_other each.
if Alex 's row only appeared once in one DataFrame but appeared multiple times in
another, Alex 's row will only be included once in the resulting DataFrame.
if you want to include duplicating rows only once, then use the intersect(~) method instead
PySpark DataFrame | join method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's join(~) method joins two DataFrames using the given join method.
Parameters
1. other | DataFrame
By default, how="inner" . See examples below for the type of joins implemented.
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Performing inner, left and right joins
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
Inner join
For inner join, all rows that have matching values in both the source and right DataFrame will be
present in the resulting DataFrame:
filter_none
+----+---+------+
|name|age|salary|
+----+---+------+
+----+---+------+
For left join (or left-outer join), all rows in the left DataFrame and matching rows in the right
DataFrame will be present in the resulting DataFrame:
filter_none
df1.join(df2, on="name", how="left").show () # how="left_outer" works
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
For right (right-outer) join, all rows in the right DataFrame and matching rows in the left
DataFrame will be present in the resulting DataFrame:
filter_none
+----+----+------+
|name| age|salary|
+----+----+------+
|Doge|null| 100|
+----+----+------+
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
For outer join, both the left and right DataFrames will be present:
filter_none
+-----+----+------+
| name| age|salary|
+-----+----+------+
| Doge|null| 100|
+-----+----+------+
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
Left anti-join
For left anti-join, all rows in the left DataFrame that are not present in the right DataFrame will
be in the resulting DataFrame:
filter_none
+-----+---+
| name|age|
+-----+---+
|Cathy| 22|
+-----+---+
Left semi-join
Left semi-join is the opposite of left-anti join, that is, all rows in the left DataFrame that are
present in the right DataFrame will be in the resulting DataFrame:
filter_none
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 24|
+----+---+
Up to now, we have specified the join key using the on parameter. Let's now consider the case
when the join keys have different labels. Suppose one DataFrame is as follows:
filter_none
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["NAME", "salary"])
df2.show ()
+----+------+
|NAME|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+
We can join using name of df1 and NAME of df2 like so:
filter_none
+----+---+----+------+
|name|age|NAME|salary|
+----+---+----+------+
+----+---+----+------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's limit(~) method returns a new DataFrame with the number of rows specified.
Parameters
1. num | number
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+-----+
| name| age|
+-----+-----+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+-----+
df.limit(2).show ()
+----+----+
| age|name|
+----+----+
|Alex| 15|
| Bob| 20|
+----+----+
Note that show(~) method actually has a parameter that limits the number of rows printed:
filter_none
df.show (n=2)
+----+----+
| age|name|
+----+----+
|Alex| 15|
| Bob| 20|
+----+----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's orderBy(~) method returns a new DataFrame that is sorted based on the
specified columns.
Parameters
1. cols | string or list or Column | optional
If a list of booleans is passed, then sort will respect this order. For example,
if [True,False] is passed and cols=["colA","colB"] , then the DataFrame will first be sorted in
ascending order of colA , and then in descending order of colB . Note that the second sort
will be relevant only when there are duplicate values in colA .
By default, ascending=True .
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 22, 200], ["Bob", 24, 300], ["Cathy", 22, 100]], ["name", "age", "salary"])
df.show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
Sorting PySpark DataFrame by single column in ascending order
df.orderBy("age").show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
df.orderBy(["age","salary"]).show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
df.orderBy("age", ascending=False).show ()
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's printSchema(~) method prints the schema, that is, the columns' name and type
of the DataFrame.
Parameters
This method does not take in any parameters
Return Value
None .
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Printing the name and type of each column (schema) in PySpark DataFrame
To obtain the schema, or the name and type of each column of our DataFrame:
filter_none
df.printSchema()
root
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's randomSplit(~) method randomly splits the PySpark DataFrame into a list of
smaller DataFrames using Bernoulli sampling.
Parameters of randomSplit
1. weights | list of numbers
The list of weights that specify the distribution of the split. For instance, setting [0.8,0.2] will split
the PySpark DataFrame into 2 smaller DataFrames using the following logic:
a random number is generated between 0 and 1 for each row of the original DataFrame.
o if the random number is between 0 and 0.8, then the row will be placed in the first
sub-DataFrame
o if the random number is between 0.8 and 1.0, then the row will be placed in the
second sub-DataFrame
we assume that the PySpark DataFrame has two partitions (blue and green).
the rows are first locally sorted based on some column value in each partition. This
sorting guarantees that as long as the same rows are in each partition (regardless of their
ordering), we would always end up with the same deterministic ordering.
the acceptance range of the first split is 0 to 0.8 . Any row whose generated random number
is between 0 and 0.8 will be placed in the first split.
the acceptance range of the second split is 0.8 to 1.0 . Any row whose generated random
number is between 0.8 and 1.0 will be placed in the second split.
What's important here is that there is never a guarantee that the first DataFrame will have 80% of
the rows, and the second will have 20%. For instance, suppose the random number generated for
each row falls between 0 and 0.8 - this means that none of the rows will end up in the second
DataFrame split:
On average, we should expect that the first DataFrame will have 80% of the rows while the
second DataFrame with 20% of the rows, but the actual split may be very different.
NOTE
Calling the function with the same seed will always generate the same results. There is a caveat to
this as we shall see later.
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40], ["Dave", 40]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
To randomly split this PySpark DataFrame into 2 sub-DataFrames with a 75-25 row split:
filter_none
_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
| Bob| 30|
|Dave| 40|
+----+---+
Even though we expect the first DataFrame to contain 3 rows while the second DataFrame to
contain 1 row, we see that split was a 50-50. This is because, as discussed above, randomSplit(~) is
based on Bernoulli sampling.
The seed parameter is used for reproducibility. For instance, consider the following PySpark
DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40], ["Dave", 40]], ["name", "age"])
df
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
Running the randomSplit(~) method with the same seed will guarantee the same splits given that the
PySpark DataFrame is partitioned in the exact same way:
filter_none
_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
| Bob| 30|
|Dave| 40|
+----+---+
Running the above multiple times will always yield the same splits since the partitioning of the
PySpark DataFrame is the same.
We can see how the rows of a PySpark DataFrame are partitioned by converting the DataFrame
into a RDD, and then using the glom() method:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 30], ["Cathy", 40], ["Dave", 40]], ["name", "age"])
[[],
[Row(name='Alex', age=20)],
[],
[Row(name='Bob', age=30)],
[],
[Row(name='Cathy', age=40)],
[],
[Row(name='Dave', age=40)]]
Here, we see that our PySpark DataFrame is split into 8 partitions but half of them are empty.
df = df.repartition (2)
[[Row(name='Alex', age=20),
Row(name='Bob', age=30),
Row(name='Cathy', age=40),
Row(name='Dave', age=40)],
[]]
Even though the content of the DataFrame is the same, we now only have 2 partitions instead of 8
partitions.
_df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Dave| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
+----+---+
Notice how even though we used the same seed, we ended up with a different split. This confirms
that the seed parameter only guarantees consistent splits only if the underlying partition is the
same. You should be cautious of this behaviour because partitions can change after a shuffle
operation (e.g. join(~) and groupBy(~) ).
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's repartition(~) method returns a new PySpark DataFrame with the data split
into the specified number of partitions. This method also allows to partition by column values.
Parameters
1. numPartitions | int
The number of patitions to break down the DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Partitioning a PySpark DataFrame
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
By default, the number of partitions depends on the parallelism level of your PySpark
configuration:
filter_none
df.rdd .getNumPartitions ()
We can see how the rows of our DataFrame are partitioned using the glom() method of the
underlying RDD:
filter_none
[[],
[],
[Row(name='Alex', age=20)],
[],
[],
[Row(name='Bob', age=30)],
[],
[Row(name='Cathy', age=40)]]
Here, we can see that we have indeed 8 partitions, but only 3 of the partitions have a Row in them.
Now, let's repartition our DataFrame such that the Rows are divided into only 2 partitions:
filter_none
df_new = df.repartition(2)
df_new.rdd .getNumPartitions ()
[[Row(name='Alex', age=20),
Row(name='Bob', age=30),
Row(name='Cathy', age=40)],
[]]
As demonstrated here, there is no guarantee that the rows will be evenly distributed in the
partitions.
df = spark.createDataFrame ([("Alex", 20), ("Bob", 30), ("Cathy", 40), ("Alex", 50)], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
| Alex| 50|
+-----+---+
[[Row(name='Alex', age=20),
Row(name='Cathy', age=40),
Row(name='Alex', age=50)],
[Row(name='Bob', age=30)]]
Here, notice how the rows with the same value for name ( 'Alex' in this case) end up in the same
partition.
[[Row(name='Alex', age=20)],
[Row(name='Bob', age=30)],
[Row(name='Alex', age=50)],
[Row(name='Cathy', age=40)]]
Here, we are repartitioning by the name and age columns into 4 partitions.
We can also use the default number of partitions by specifying column labels only:
filter_none
df_new = df.repartition("name")
df_new.rdd .getNumPartitions ()
1
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's replace(~) method returns a new DataFrame with certain values replaced. We
can also specify which columns to perform replacement in.
Parameters
1. to_replace | boolean , number , string , list or dict | optional
The columns to focus on. By default, all columns will be checked for replacement.
Return Value
PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.
To replace the value "Alex" with "ALEX" and "Bob" with "BOB" in the name column:
filter_none
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| BOB| 30|
|Cathy| 40|
+-----+---+
Replacing multiple values with a single value
To replace the values "Alex" and "Bob" with "SkyTowner" in the name column:
filter_none
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+
To replace the values "Alex" and "Bob" with "SkyTowner" in the entire DataFrame:
filter_none
df.replace(["Alex","Bob"], "SkyTowner").show ()
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+
To replace "Alex" with "ALEX" and "Bob" with "BOB" in the name column using a dictionary:
filter_none
df.replace({
"Alex": "ALEX",
"Bob": "Bob",
}, subset=["name"]).show ()
WARNING
Mixed-type replacements are not allowed. For instance, the following is not allowed:
filter_none
df.replace({
"Alex": "ALEX",
30: 99,
}, subset=["name","age"]).show()
Here, we are performing one string replacement and one integer replacement. Since this is a mix-
typed replacement, PySpark throws an error. To avoid this error, perform the two replacements
individually.
Replacing multiple values in multiple columns
df.show ()
+----+----+
|col1|col2|
+----+----+
| aa| AA|
| bb| BB|
+----+----+
df.replace({
"AA": "@@@",
"bb": "###",
}, subset=["col1","col2"]).show ()
+----+----+
|col1|col2|
+----+----+
| aa| @@@|
| ###| BB|
+----+----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's sample(~) method returns a random subset of rows of the DataFrame.
Parameters
1. withReplacement | boolean | optional
If True , then sample with replacement, that is, allow for duplicate rows.
If False , then sample without replacement, that is, do not allow for duplicate rows.
By default, withReplacement=False .
WARNING
2. fraction | float
A number between 0 and 1 , which represents the probability that a value will be included in the
sample. For instance, if fraction=0.5 , then each element will be included in the sample with a
probability of 0.5 .
WARNING
The sample size of the subset will be random since the sampling is performed using Bernoulli
sampling (if withReplacement=True ). This means that even setting fraction=0.5 may result in a sample
without any rows! On average though, the supplied fraction value will reflect the number of rows
returned.
The seed for reproducibility. By default, no seed will be set which means that the derived samples
will be random each time.
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
["Bob", 24],\
["Cathy", 22],\
["Doge", 22]],\
["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 22|
+-----+---+
To get a random sample in which the probability that an element is included in the sample is 0.5 :
filter_none
df.sample(fraction=0.5).show ()
+----+---+
|name|age|
+----+---+
|Doge| 22|
+----+---+
Running the code once again may yield a sample of different size:
filter_none
df.sample(fraction=0.5).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
|Cathy| 22|
+-----+---+
This is because the sampling is based on Bernoulli sampling as explained in the beginning.
["Bob", 24],\
["Cathy", 22],\
["Doge", 22]],\
["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 22|
+-----+---+
df.sample(fraction=0.5, withReplacement=True).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
| Bob| 24|
| Bob| 24|
|Cathy| 22|
+-----+---+
Notice how the sample size can exceed the original dataset size.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's sampleBy(~) method performs stratified sampling based on a column. Consult
examples below for clarification.
Parameters
1. col | Column or string
The column by which to perform sampling.
2. fractions | dict
The probability with which to include the value. Consult examples below for clarification.
Using the same value for seed produces the exact same results every time. By default, no seed will
be set, which means that the outcome will be different every time you run the method.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
vals = ['a','a','a','a','a','a','b','b','b','b']
df.show (3)
+-----+
|value|
+-----+
| a|
| a|
| a|
+-----+
df.sampleBy('value', fractions={'a':0.5,'b':0.25}).show ()
+-----+
|value|
+-----+
| a|
| a|
| a|
| b|
| b|
+-----+
Here, rows with value 'a' will be included in our sample with a probability of 0.5 , while rows with
value 'b' will be included with a probability of 0.25 .
WARNING
The number of samples that will be included will be different each time. For instance,
specifying {'a':0.5} does not mean that half the rows with the value 'a' will be included - instead it
means that each row will be included with a probability of 0.5 . This means that there may be cases
when all rows with value 'a' will end up in the final sample.
R E L AT E D
PySpark DataFrame's sample(~) method returns a random subset of rows of the DataFrame.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
The select(~) method of PySpark DataFrame returns a new DataFrame with the specified columns.
Parameters
1. *cols | string , Column or list
The columns to include in the returned DataFrame.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df.select("name").show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
df.select(df["name"]).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Here, df["name"] is of type Column . Here, you can think of the role of select(~) as converting
a Column object into a PySpark DataFrame.
import pyspark.sql.functions as F
df.select(F.col ("name")).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
df.select("name","age").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
import pyspark.sql.functions as F
df.select(F.col("name"), F.col("age")).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df.select("*").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Selecting columns given a list of column labels
df.select(cols).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, the * operator is used to convert the list into positional arguments.
df.select(cols).show ()
+----+
|name|
+----+
|Alex|
| Bob|
+----+
Here, we are using Python's list comprehension to get a list of column labels that begin with the
substring "na" :
filter_none
cols
['name']
PySpark DataFrame | selectExpr method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's selectExpr(~) method returns a new DataFrame based on the specified SQL
expression.
Parameters
1. *expr | string
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Selecting data using SQL expressions in PySpark DataFrame
To get a new DataFrame where the values for the name column is uppercased:
filter_none
+----------+---------+
|upper_name|(age * 2)|
+----------+---------+
| ALEX| 40|
| BOB| 60|
| CATHY| 80|
+----------+---------+
We should use selectExpr(~) rather than select(~) to extract columns while performing some simple
transformations on them - just as we have done here.
NOTE
There exists a similar method expr(~) in the pyspark.sql.functions library. expr(~) also takes in as
argument a SQL expression, but the difference is that the return type is a PySpark Column . The
following usage of selectExpr(~) and expr(~) are equivalent:
filter_none
+-----------+
|upper(name)|
+-----------+
| ALEX|
| BOB|
| CATHY|
+-----------+
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 60|
+----+---+
We can use classic SQL clauses like AND and LIKE to formulate more complicated expressions:
filter_none
+------+
|result|
+------+
| true|
| false|
+------+
Here, we are checking for rows where age is less than 30 and the name starts with the letter A .
df.select (col).show ()
+------+
|result|
+------+
| true|
| false|
+------+
I personally prefer using selectExpr(~) because the syntax is cleaner and the meaning is intuitive for
those who are familiar with SQL.
Another application of selectExpr(~) is to check for the existence of values in a PySpark column.
Please check out the recipe here .
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's show(~) method prints the rows of the DataFrame on the console.
Parameters
1. n | int | optional
If True , then strings that are longer than 20 characters will be truncated.
If int , then strings that are longer than truncate will be truncated.
If truncation occurs, then the left part of the string is preserved. By default, truncate=True .
If True , then the rows are printed with one line for each column value. By default, vertical=False .
Return Value
None .
Examples
Consider the following PySpark DataFrame:
filter_none
df.show () # n=20
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
df.show (n=2)
+----+---+
|name|age|
+----+---+
|Alex| 15|
| Bob| 20|
+----+---+
only showing top 2 rows
df.show(truncate=2)
+----+---+
|name|age|
+----+---+
| Al| 15|
| Bo| 20|
| Ca| 25|
+----+---+
df.show(truncate=False)
+-----+---+
|name |age|
+-----+---+
|Alex |15 |
|Bob |20 |
|Cathy|25 |
+-----+---+
df.show(vertical=True)
-RECORD 0-----
name | Alex
age | 15
-RECORD 1-----
name | Bob
age | 20
-RECORD 2-----
name | Cathy
age | 25
Published
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's sort(~) method returns a new DataFrame with the rows sorted based on the
specified columns.
Parameters
1. cols | string or list or Column
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 30|
| Bob| 20|
|Cathy| 20|
+-----+---+
df.sort("age").show () # ascending=True
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
import pyspark.sql.functions as F
df.sort(F.col("age")).show ()
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
df.sort("age", ascending=False).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 30|
| Bob| 20|
|Cathy| 20|
+-----+---+
To sort a PySpark DataFrame by the age column first, and then by the name column both in
ascending order:
filter_none
df.sort(["age", "name"]).show ()
+-----+---+
| name|age|
+-----+---+
| Bob| 20|
|Cathy| 20|
| Alex| 30|
+-----+---+
Here, Bob and Cathy appear before Alex because their age ( 20 ) is smaller. Bob then comes
before Cathy because B comes before C .
We can also pass a list of booleans to specify the desired ordering of each column:
filter_none
+-----+---+
| name|age|
+-----+---+
|Cathy| 20|
| Bob| 20|
| Alex| 30|
+-----+---+
Here, we are first sorting by age in ascending order, and then by name in descending order.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's summary(~) method returns a PySpark DataFrame containing basic summary
statistics of numeric columns.
Parameters
1. statistics | string | optional
mean
stddev
min
max
By default, all the above as well as the 25%, 50%, and 75% percentiles are computed.
Return Value
PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22], ["Doge", 30]], ["name", "age"])
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Doge| 30|
+-----+---+
df.summary().show ()
+-------+----+-----------------+
|summary|name| age|
+-------+----+-----------------+
| count| 4| 4|
| mean|null| 24.0|
| stddev|null|4.320493798938574|
| min|Alex| 20|
| 25%|null| 20|
| 50%|null| 22|
| 75%|null| 24|
| max|Doge| 30|
+-------+----+-----------------+
df.summary("max", "min").show ()
+-------+----+---+
|summary|name|age|
+-------+----+---+
| max|Doge| 30|
| min|Alex| 20|
+-------+----+---+
df.summary("60%").show ()
+-------+----+---+
|summary|name|age|
+-------+----+---+
| 60%|null| 24|
+-------+----+---+
To summarise certain columns instead, use the select(~) method first to select the columns that you
want to summarize:
filter_none
+-------+---+
|summary|age|
+-------+---+
| max| 30|
| min| 20|
+-------+---+
R E L AT E D
PySpark DataFrame's describe(~) method returns a new PySpark DataFrame holding summary statistics
of the specified columns.
df.tail(num=2)
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's take(~) method returns the first num number of rows as a list of Row objects.
Parameters
1. num | integer
Return Value
A list of Row objects.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting the first n number of rows of PySpark DataFrame as list of Row objects
df.take (2)
The difference between methods takes(~) and head(~) is takes always return a list of Row objects,
whereas head(~) will return just a Row object in the case when we set head(n=1) .
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
df.take(1)
[Row(name='Alex', age=20)]
df.head (1)
[Row(name='Alex', age=20)]
For all other values of n , the methods take(~) and head(~) yield the same output.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's toDF(~) method returns a new DataFrame with the columns arranged in the
order that you specify.
WARNING
This method only allows you to change the ordering of the columns - the new DataFrame must
contain the same columns as before.
Parameters
1. *cols | str
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
df.toDF("age", "name").show ()
+----+----+
| age|name|
+----+----+
|Alex| 20|
| Bob| 30|
+----+----+
Note that if the columns of the new DataFrame do not match the original DataFrame, then an
error will be thrown:
filter_none
df.toDF("age").show ()
+----+----+
| age|name|
+----+----+
|Alex| 20|
| Bob| 30|
+----+----+
Here:
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's toJSON(~) method converts the DataFrame into a string-typed RDD. When
the RDD data is extracted, each row of the DataFrame will be converted into a string JSON.
Consult the examples below for clarification.
Parameters
1. use_unicode | boolean
Return Value
A MapPartitionsRDD object.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
|André| 20|
| Bob| 30|
|Cathy| 30|
+-----+---+
df.toJSON().first()
'{"name":"André","age":20}'
import json
json.loads(df.toJSON().first())
df.toJSON().collect ()
['{"name":"André","age":20}',
'{"name":"Bob","age":30}',
'{"name":"Cathy","age":30}']
Here:
we are using the RDD.map(~) method to apply a custom function on each element of the
RDD.
df.toJSON().first () # use_unicode=True
'{"name":"André","age":20}'
df.toJSON(use_unicode=False).first ()
b'{"name":"Andr\xc3\xa9","age":20}'
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's toPandas(~) method converts a PySpark DataFrame into a Pandas DataFrame.
WARNING
All the data from the worker nodes are transferred to the Driver, and so make sure that
your Driver has sufficient memory.
Return Value
A Pandas DataFrame.
Examples
Consider the following DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
name age
0 Alex 20
1 Bob 24
2 Cathy 22
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's transform(~) method applies a function on the DataFrame that called this
method and returns a new PySpark DataFrame.
Parameters
1. func | function
Return Value
PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To get a new PySpark DataFrame where the columns are sorted in ascending order:
filter_none
def sort_columns(df_input):
df.transform(sort_columns).show ()
+---+----+
|age|name|
+---+----+
| 25|Alex|
| 30| Bob|
+---+----+
Here, the * converts the list of column labels into positional arguments of the select(~) method.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's transform(~) method applies a function on the DataFrame that called this
method and returns a new PySpark DataFrame.
Parameters
1. func | function
Return Value
PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To get a new PySpark DataFrame where the columns are sorted in ascending order:
filter_none
def sort_columns(df_input):
df.transform(sort_columns).show ()
+---+----+
|age|name|
+---+----+
| 25|Alex|
| 30| Bob|
+---+----+
Here, the * converts the list of column labels into positional arguments of the select(~) method.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's union(~) method concatenates two DataFrames vertically based on column
positions.
WARNING
the DataFrames will be vertically concatenated based on the column position rather than
the labels. See examples below for clarification.
Parameters
1. other | PySpark DataFrame
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Concatenating PySpark DataFrames vertically based on column position
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 25], ["Doge", 30], ["Eric", 50]], ["name", "age"])
df2.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
|Doge| 30|
|Eric| 50|
+----+---+
df1.union(df2).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Alex| 25|
| Doge| 30|
| Eric| 50|
+-----+---+
df1 = spark.createDataFrame ([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
df2 = spark.createDataFrame ([["Alex", 250], ["Doge", 200], ["Eric", 100]], ["name", "salary"])
df2.show ()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
|Doge| 200|
|Eric| 100|
+----+------+
df1.union(df2).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
| Alex|250|
| Doge|200|
| Eric|100|
+-----+---+
Notice how even though the two DataFrames had separate column labels, the method still
concatenated them. This is because the concatenation is based on the column positions and so the
labels play no role here. You should be wary of this behaviour because the union(~) method may
yield incorrect DataFrames like the one above without throwing an error!
R E L AT E D
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Parameters
1. other | PySpark DataFrame
If True , then no error will be thrown if the column labels of the two DataFrames do not
align. If in case of misalignments, then null values will be set.
If False , then an error will be thrown if the column labels of the two DataFrames do not
align.
By default, allowMissingColumns=False .
Return Value
A new PySpark DataFrame .
Examples
Concatenating PySpark DataFrames vertically by aligning columns
df1.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
+---+---+---+
df2.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
To concatenate these two DataFrames vertically by aligning the columns:
filter_none
df1.unionByName(df2).show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
By default, allowMissingColumns=False , which means that if the two DataFrames do not have exactly
matching column labels, then an error will be thrown.
df1.show ()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Here's the other PySpark DataFrame that have slightly different column labels:
filter_none
df2.show ()
+---+---+---+
| B| C| D|
+---+---+---+
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
Since the column labels do not match, calling unionByName(~) will result in an error:
filter_none
df1.unionByName(df2).show () # allowMissingColumns=False
df1.unionByName(df2, allowMissingColumns=True).show ()
+----+---+---+----+
| A| B| C| D|
+----+---+---+----+
| 1| 2| 3|null|
|null| 4| 5| 6|
|null| 7| 8| 9|
+----+---+---+----+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's where(~) method returns rows of the DataFrame that satisfies the given
condition.
NOTE
The where(~) method is an alias for the filter(~) method.
Parameters
1. condition | Column or string
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Basic usage
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can pass a Column object that represents a boolean mask:
filter_none
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Equivalently, we can use the col(~) function of sql.functions to refer to the column:
filter_none
+-----+---+
| name|age|
+-----+---+
| Bob| 30|
|Cathy| 40|
+-----+---+
Compound queries
The where(~) method supports the AND and OR statement like so:
filter_none
+----+---+
|name|age|
+----+---+
| Bob| 30|
+----+---+
Dealing with null values
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
df.where("age != 10").show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Notice how only Alex's row is returned even though the other two rows technically have age!=10 .
This happens because PySpark's where(-) method filters our null values by default.
To prevent rows with null values getting filtered out, we can perform the query like so:
filter_none
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+
Note that PySpark's treatment of null values is different compared to Pandas because Pandas will
retain rows with missing values, as demonstrated below:
filter_none
import pandas as pd
df = pd.DataFrame ({
})
df[df["col"] != "a"]
col
1 b
2 None
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
The label of the new column. If colName already exists, then supplied col will update the existing
column. If colName does not exist, then col will be a new column.
2. col | Column
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 50|
+-----+---+
To update an existing column, supply its column label as the first argument:
filter_none
df.withColumn("age", 2 * df.age).show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 50|
| Bob| 60|
|Cathy|100|
+-----+---+
Note that you must pass in a Column object as the second argument, and so you cannot simply use
a list as the new column values.
import pyspark.sql.functions as F
+-----+---+----+
| name|age|AGEE|
+-----+---+----+
| Alex| 25| 0|
| Bob| 30| 0|
|Cathy| 50| 0|
+-----+---+----+
Here, F.lit(0) returns a Column object holding 0 s. Note that since column labels are case insensitive,
if you pass in "AGE" as the first argument, you would end up overwriting the age column.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's withColumnRenamed(~) method is used to replace column labels. If the column
label that you want to replace does not exist, no error will be thrown.
Parameters
1. existing | string | optional
2. new | string
Return Value
A PySpark DataFrame ( pyspark.sql.dataframe.DataFrame ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df.withColumnRenamed("age", "AGE").show ()
+----+---+
|name|AGE|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Note that no error will be thrown if the column label you want to replace does not exist:
filter_none
df.withColumnRenamed("ageeee", "AGE").show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To replace multiple column labels at once, we can chain the withColumnRenamed(-) method like so:
filter_none
+----+---+
|NAME|AGE|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's columns property returns the column labels as a list.
Return Value
A standard list of strings.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
['name', 'age']
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's dtypes property returns the column labels and types as a list of tuples.
Return Value
List of tuples.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
To obtain the column labels and types, use the dtypes property:
filter_none
df.dtypes
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's rdd property returns the RDD representation of the DataFrame. Keep in
mind that PySpark DataFrames are internally represented as RDD.
Return Value
RDD containing Row objects.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To convert our PySpark DataFrame into a RDD, use the rdd property:
filter_none
rdd = df.rdd
rdd.collect ()
Here, we are using the collect() method to see the content of our RDD, which is a list
of Row objects.
PySpark DataFrame | dtypes property
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's dtypes property returns the column labels and types as a list of tuples.
Return Value
List of tuples.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
To obtain the column labels and types, use the dtypes property:
filter_none
df.dtypes
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Return Value
A standard list of strings.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
['name', 'age']
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's SQL Row asDict(~) method converts a Row object into a dictionary.
Parameters
1. recursive | boolean | optional
If True , then nested Row objects will be converted into dictionary as well.
By default, recursive=False .
Return Value
A dictionary.
Examples
Converting a PySpark Row object into a dictionary
row
Row(name='alex', age=25)
row.asDict()
By default, recursive=False , which means that nested rows will not be converted into dictionaries:
filter_none
row.asDict() # recursive=False
{'name': 'Alex', 'age': 25, 'friends': Row(name='Bob', age=30)}
To convert nested Row objects into dictionaries as well, set recursive=True like so:
filter_none
row.asDict(True)
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark DataFrame's dtypes property returns the column labels and types as a list of tuples.
Return Value
List of tuples.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 15|
| Bob| 20|
|Cathy| 25|
+-----+---+
To obtain the column labels and types, use the dtypes property:
filter_none
df.dtypes
Published by Isshin In
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's SQL Row asDict(~) method converts a Row object into a dictionary.
Parameters
1. recursive | boolean | optional
If True , then nested Row objects will be converted into dictionary as well.
By default, recursive=False .
Return Value
A dictionary.
Examples
Converting a PySpark Row object into a dictionary
row
Row(name='alex', age=25)
row.asDict()
By default, recursive=False , which means that nested rows will not be converted into dictionaries:
filter_none
row.asDict() # recursive=False
To convert nested Row objects into dictionaries as well, set recursive=True like so:
filter_none
row.asDict(True)
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Parameters
1. *alias | string
A dictionary holding additional meta-information to store in the StructField of the returned Column .
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| ALEX| 20|
| BOB| 30|
|CATHY| 40|
+-----+---+
Most methods in the PySpark SQL Functions library return Column objects whose label is governed by
the method that we use. For instance, consider the lower(~) method:
filter_none
|lower(name)|
+-----------+
| alex|
| bob|
| cathy|
+-----------+
Here, the PySpark Column returned by lower(~) has the label lower(name) by default.
+----------+
|lower_name|
+----------+
| alex|
| bob|
| cathy|
+----------+
Here, we have assigned the label "lower_name" to the column returned by lower(~) .
To store some meta-data in a PySpark Column, we can add the metadata option in alias(~) :
filter_none
df_new.show ()
+----------+
|lower_name|
+----------+
| alex|
| bob|
| cathy|
+----------+
To access the metadata , we can use the PySpark DataFrame's schema property:
filter_none
df_new.schema["lower_name"].metadata["some_data"]
10
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's cast(~) method returns a new Column of the specified type.
Parameters
1. dataType | Type or string
Return Value
A new Column object.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
To convert the type of the DataFrame's age column from numeric to string :
filter_none
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
df_new.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
To confirm that the column type has been converted to string, use the printSchema() method:
filter_none
df_new.printSchema ()
root
df_new.printSchema ()
root
df_new.printSchema ()
root
To convert the PySpark column type to date, use the to_date(~) method instead of cast(~) .
PySpark Column | contains method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's contains(~) method returns a Column object of booleans where True corresponds
to column values that contain the specified substring.
Parameters
1. other | string or Column
Return Value
A Column object of booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
Getting rows that contain a substring in PySpark DataFrame
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, F.col("name").contains("le") returns a Column object holding booleans where True corresponds to
strings that contain the substring "le" :
filter_none
+------------------+
|contains(name, le)|
+------------------+
| true|
| false|
| false|
+------------------+
In our solution, we use the filter(~) method to extract rows that correspond to True .
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's dropFields(~) method returns a new PySpark Column object with the specified
nested fields removed.
Parameters
1. *fieldNames | string
Return Value
A PySpark Column.
Examples
Consider the following PySpark DataFrame with some nested Rows:
filter_none
data = [
df = spark.createDataFrame (data)
df.show ()
+-----+---+---------------+
| name|age| friend|
+-----+---+---------------+
+-----+---+---------------+
df.printSchema ()
root
To remove the age and height fields under friend , use the dropFields(~) method:
filter_none
df_new.show ()
+-----+---+------+
| name|age|friend|
+-----+---+------+
|Cathy| 40|{Doge}|
+-----+---+------+
we are using the withColumn(~) method to update the friend column with the new column
returned by dropFields(~) .
df_new.printSchema ()
root
NOTE
Even if the nested field you wish to delete does not exist, no error will be thrown:
filter_none
updated_col = df["friend"].dropFields("ZZZZZZZZZ")
df_new.show ()
+-----+---+---------------+
| name|age| friend|
+-----+---+---------------+
+-----+---+---------------+
Here, the nested field "ZZZZZZZZZ" obviously does not exist but no error was thrown.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's endswith(~) method returns a column of booleans where True is given to strings
that end with the specified substring.
Parameters
1. other | string or Column
Return Value
A Column object holding booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
+-----------------+
|endswith(name, x)|
+-----------------+
| true|
| false|
| false|
+-----------------+
We then use the PySpark DataFrame's filter(~) method to fetch rows that correspond to True .
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's getItem(~) method extracts a value from the lists or dictionaries in a PySpark
Column.
Parameters
1. key | any
for lists, key should be an integer index indicating the position of the value that you wish
to extract.
for dictionaries, key should be the key of the values you wish to extract.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
filter_none
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
To extract the second value from each list in the vals column:
filter_none
df_result.show ()
+-------+
|2nd val|
+-------+
| 6|
| 8|
+-------+
df_result.show ()
+-------+
|2nd val|
+-------+
| 6|
| 8|
+-------+
Specifying an index position that is out of bounds for the list will return a null value:
filter_none
df_result.show ()
+-------+
|2nd val|
+-------+
| null|
| null|
+-------+
df.show ()
+----------------+
| vals|
+----------------+
| {A -> 4}|
+----------------+
df_result.show ()
+-------+
|vals[A]|
+-------+
| 4|
| 5|
+-------+
Note that referring to keys that do not exist will return null :
filter_none
df_result.show ()
+-------+
|vals[C]|
+-------+
| null|
| null|
+-------+
R E L AT E D
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in a PySpark
Column.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SQL Functions' element_at(~) method is used to extract values from lists or maps in a
PySpark Column.
Parameters
1. col | string or Column
2. extraction | int
The position of the value that you wish to extract. Negative positioning is supported - extraction=-
1 will extract the last element from each list.
WARNING
The position is not indexed-based. This means that extraction=1 will extract the first value in the
lists or maps.
Return Value
A new PySpark Column.
Examples
Extracting n-th value from arrays in PySpark Column
df.show ()
+------+
| vals|
+------+
|[5, 6]|
|[7, 8]|
+------+
To extract the second value from each list in vals , we can use element_at(~) like so:
filter_none
df_res.show ()
+---------+
|2nd value|
+---------+
| 6|
| 8|
+---------+
we are using the alias(~) method to assign a label to the column returned by element_at(~) .
Note that extracting values that are out of bounds will return null :
filter_none
df_res.show ()
+-------------------+
|element_at(vals, 3)|
+-------------------+
| null|
| null|
+-------------------+
We can also extract the last element by supplying a negative value for extraction :
filter_none
df_res.show ()
+----------+
|last value|
+----------+
| 6|
| 8|
+----------+
Extracting values from maps in PySpark Column
df.show ()
+----------------+
| vals|
+----------------+
| {A -> 4}|
+----------------+
To extract the values that has the key 'A' in the vals column:
filter_none
df_res.show ()
+-------------------+
|element_at(vals, A)|
+-------------------+
| 4|
| 5|
+-------------------+
Note that extracting values using keys that do not exist will return null :
filter_none
df_res.show ()
+-------------------+
|element_at(vals, B)|
+-------------------+
| null|
| 6|
+-------------------+
Here, the key 'B' does not exist in the map {'A':4} so a null was returned for that row.
R E L AT E D
PySpark Column's getItem(~) method extracts a value from the lists or dictionaries in a PySpark
Column.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's isin(~) method returns a Column object of booleans where True corresponds to
column values that are included in the specified list of values.
Parameters
1. *cols | any type
Return Value
A Column object of booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values are contained in a list of values in PySpark DataFrame
To get rows where values for the name column is either "Cathy" or "Alex" :
filter_none
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
+-----------------------+
+-----------------------+
| true|
| false|
+-----------------------+
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's isNotNull() method identifies rows where the value is not null.
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 25|
| Bob| 30|
|Cathy|null|
+-----+----+
df.select (df.age.isNotNull()).show ()
+-------------+
|(age IS NULL)|
+-------------+
| false|
| false|
| true|
+-------------+
df.filter (df.age.isNotNull()).show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
Here, the filter(~) extracts rows that correspond to True in the boolean column returned
by isNotNull() method.
PySpark Column | isin method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's isin(~) method returns a Column object of booleans where True corresponds to
column values that are included in the specified list of values.
Parameters
1. *cols | any type
Return Value
A Column object of booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values are contained in a list of values in PySpark DataFrame
To get rows where values for the name column is either "Cathy" or "Alex" :
filter_none
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
+-----------------------+
+-----------------------+
| true|
| false|
+-----------------------+
Note that if you have a list of values instead, use the * operator to convert the list into positional
arguments:
filter_none
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's isNull() method identifies rows where the value is null.
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex| 25|
| Bob| 30|
|Cathy|null|
+-----+----+
+-----------------+
+-----------------+
| true|
| true|
| false|
+-----------------+
df.where (df.age.isNull()).show ()
+-----+----+
| name| age|
+-----+----+
|Cathy|null|
+-----+----+
Here, the where(~) method fetches rows that correspond to True in the boolean column returned by
the isNull() method.
One common mistake is to use equality to compare null values. For example, consider the
following DataFrame:
filter_none
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob|30.0|
|Cathy|null|
+-----+----+
+----+---+
|name|age|
+----+---+
+----+---+
Notice how Cathy's row where the age is null is not picked up. When comparing null values, we
should always use isNull() instead.
import numpy as np
df.show ()
+-----+----+
| name| age|
+-----+----+
| Alex|25.0|
| Bob| NaN|
|Cathy|null|
+-----+----+
Here, the age column contains both NaN and null . In PySpark, NaN and null are treated as different
entities as demonstrated below:
filter_none
df.where (F.col ("age").isNull()).show ()
+-----+----+
| name| age|
+-----+----+
|Cathy|null|
+-----+----+
Here, notice how Bob's row whose age is NaN is not picked up. To get rows with NaN , use
the isnan(-) method like so:
filter_none
+----+---+
|name|age|
+----+---+
| Bob|NaN|
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's otherwise(~) method is used after a when(~) method to implement an if-else logic.
Click here for our documentation on when(~) method.
Parameters
1. value
The value to assign if the conditions set by when(~) are not satisfied.
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Basic usage
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+
To replace the name Alex with Doge , and others with Eric :
filter_none
import pyspark.sql.functions as F
+-----------------------------------------------+
+-----------------------------------------------+
| Doge|
| Eric|
| Eric|
+-----------------------------------------------+
Note that we can replace our existing column with the new column like so:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Doge| 25|
|Eric| 30|
|Eric| 50|
+----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's rlike(~) method returns a Column of booleans where True corresponds to string
column values that match the specified regular expression.
NOTE
Return Value
A Column object of booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+
Getting rows where values match some regular expression in PySpark DataFrame
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
Here, the regular expression "^A" matches strings that begin with "A" .
Also, F.col("name").rlike("^A") returns a Column object of booleans:
filter_none
+---------------+
|RLIKE(name, ^A)|
+---------------+
| true|
| false|
+---------------+
In our solution, we use the filter(~) method to fetch only the rows that correspond to True .
Published by Isshin Inada
Edited by 0 others
thumb_up
thumb_down
PySpark Column | startswith method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's startswith(~) method returns a column of booleans where True is given to strings
that begin with the specified substring.
Parameters
1. other | string or Column
Return Value
A Column object holding booleans.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+
+-------------------+
|startswith(name, A)|
+-------------------+
| true|
| false|
| false|
+-------------------+
We then use the PySpark DataFrame's filter(~) method to fetch rows that correspond to True .
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's substr(~) method returns a Column of substrings extracted from string column
values.
Parameters
1. startPos | int or Column
The starting position. This position is inclusive and non-index, meaning the first character is in
position 1. Negative position is allowed here as well - please consult the example below for
clarification.
Return Value
A Column object.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
+----------+
|short_name|
+----------+
| lex|
| ob|
| ath|
+----------+
the F.col("name").substr(2,3) means that we are extracting a substring starting from the 2nd
character and up to a length of 3.
even if the string is too short (e.g. "Bob" ), no error will be thrown.
Note that you could also specify a negative starting position like so:
filter_none
+----------+
|short_name|
+----------+
| le|
| Bo|
| th|
+----------+
Here, we are starting from the third character from the end (inclusive).
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's substr(~) method returns a Column of substrings extracted from string column
values.
Parameters
1. startPos | int or Column
The starting position. This position is inclusive and non-index, meaning the first character is in
position 1. Negative position is allowed here as well - please consult the example below for
clarification.
Return Value
A Column object.
Examples
Consider the following PySpark DataFrame:
filter_none
df.show ()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 30|
|Cathy| 40|
+-----+---+
+----------+
|short_name|
+----------+
| lex|
| ob|
| ath|
+----------+
the F.col("name").substr(2,3) means that we are extracting a substring starting from the 2nd
character and up to a length of 3.
even if the string is too short (e.g. "Bob" ), no error will be thrown.
Note that you could also specify a negative starting position like so:
filter_none
+----------+
|short_name|
+----------+
| le|
| Bo|
| th|
+----------+
Here, we are starting from the third character from the end (inclusive).
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark Column's withField(~) method is used to either add or update a nested field value.
Parameters
1. fieldName | string
2. col | Column
Return Value
A PySpark Column ( pyspark.sql.column.Column ).
Examples
Consider the following PySpark DataFrame with nested rows:
filter_none
data = [
df = spark.createDataFrame (data)
df.show ()
+-----+---+----------+
| name|age| friend|
+-----+---+----------+
+-----+---+----------+
Here, the friend column contains nested Row , which can be confirmed by printing out the schema:
filter_none
df.printSchema ()
root
import pyspark.sql.functions as F
+-----+---+---------+
| name|age| friend|
+-----+---+---------+
+-----+---+---------+
Note the following:
we are updating the name field of the friend column with a constant string "BOB" .
the F.lit("BOB") returns a Column object whose values are filled with the string "BOB" .
the withColumn(~) method replaces the friend column of our DataFrame with the updated
column returned by withField(~) .
Updating nested rows using original values in PySpark
To update nested rows using original values, use the withField(~) method like so:
filter_none
+-----+---+----------+
| name|age| friend|
+-----+---+----------+
+-----+---+----------+
Here, we are uppercasing the name field of the friend column using F.upper("friend.name") , which
returns a Column object.
The withField(~) column can also be used to add new field values in nested rows:
filter_none
df_new.show ()
+-----+---+----------------+
| name|age| friend|
+-----+---+----------------+
df_new.printSchema ()
root
We can see the new nested field upper_name has been added!
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's coalesce(~) method returns a new RDD with the number of partitions reduced.
Parameters
1. numPartitions | int
Whether or not to shuffle the data such that they end up in different partitions. By
default, shuffle=False .
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Consider the following RDD with 3 partitions:
filter_none
rdd.glom ().collect ()
Here:
new_rdd = rdd.coalesce(numPartitions=2)
new_rdd.glom ().collect ()
We can see that the 2nd partition merged with the 3rd partition.
Instead of merging partitions to reduce the number partitions, we can also shuffle the data:
filter_none
new_rdd.glom ().collect ()
As you can see, this results in a partitioning that is more balanced. The downside to shuffling,
however, is that this is a costly process when your data size is large since data must be transferred
from one worker node to another.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's collect(~) method returns a list containing all the items in the RDD.
Parameters
This method does not take in any parameters.
Return Value
A Python standard list.
Examples
Converting a PySpark RDD into a list of values
rdd
rdd.getNumPartitions ()
Depending on your configuration, these 8 partitions can reside in multiple machines (working
nodes). The collect(~) method sends all the data of the RDD to the driver node, and packs them in a
single list:
filter_none
rdd.collect()
[4, 2, 5, 7]
WARNING
All the data from the worker nodes will be sent to the driver node, so make sure that you have
enough memory for the driver node - otherwise you'll end up with an OutOfMemory error!
PySpark RDD | collectAsMap method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's collectAsMap(~) method collects all the elements of a pair RDD in the driver
node link and converts the RDD into a dictionary.
NOTE
Return Value
A dictionary.
Examples
Consider the following PySpark pair RDD:
filter_none
rdd.collect ()
To convert a pair RDD into a dictionary in PySpark, use the collectAsMap() method:
filter_none
rdd.collectAsMap()
WARNING
Since all the underlying data in the RDD is sent to driver node, you may encounter
an OutOfMemoryError if the data is too large.
In case of duplicate keys
When we have duplicate keys, the latter key-value pair will overwrite the former ones:
filter_none
rdd.collectAsMap()
{'a': 6, 'b': 2}
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's count(~) method returns the number of values in the RDD as an integer.
Parameters
This method does not take in any parameters.
Return Value
An integer ( int ).
Examples
Consider the following PySpark RDD:
filter_none
rdd.collect ()
To get the number of elements in the RDD, use the count() method:
filter_none
rdd.count()
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's countByKey(~) method groups by the key of the elements in a pair RDD, and counts
each group.
Parameters
This method does not take in any parameter.
Return Value
A DefaultDict[key,int] .
Examples
Consider the following PySpark pair RDD:
filter_none
rdd.collect ()
Here, the returned value is DefaultDict , which is basically a dictionary in which accessing values
that do not exist in the dictionary will return a 0 instead of throwing an error.
You can access the count of a key just as you would for an ordinary dictionary:
filter_none
counts = rdd.countByKey()
counts["a"]
counts = rdd.countByKey()
counts["z"]
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's filter(~) method extracts a subset of the data based on the given function.
Parameters
1. f | function
A function that takes in as input an item of the RDD's data and returns a boolean where:
Examples
Consider the following RDD:
filter_none
rdd
To obtain a new RDD where the values are all strictly larger than 3:
filter_none
new_rdd.collect ()
[4, 5, 7]
Here, the collect() method is used to retrieve the content of the RDD as a single list.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's first(~) method returns the first element of the RDD.
Parameters
This method does not take in any parameters.
Return Value
The type will be that of the first element of the RDD.
Examples
We create a RDD using the parallelize(~) method:
filter_none
rdd
To fetch the first element in the RDD, use the first() method:
filter_none
rdd.first()
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Return Value
An int .
Examples
Getting the number of partitions of RDD
rdd.getNumPartitions()
3
PySpark RDD | glom method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's glom() method returns a RDD holding the content of each partition.
Parameters
This method does not take in any parameters.
Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
Consider the following RDD:
filter_none
rdd.collect ()
rdd.glom().collect ()
Here:
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's keys(~) method returns the keys of a pair RDD that contains tuples of length two.
Parameters
This method does not take in any parameters.
Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
Consider the following PySpark pair RDD:
filter_none
rdd.collect ()
rdd.keys ().collect ()
Note that if the RDD is not a pair RDD, then the values are returned:
filter_none
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's map(~) method applies a function on each element of the RDD.
Parameters
1. f | function
Whether or not to let Spark assume that partitioning is still valid. This is only relevant to PairRDD .
Consult examples below for clarification. By default, preservesPartitioning=False .
Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
Applying a function to each element of RDD
new_rdd.collect ()
The preservesPartitioning parameter only comes into play when the RDD contains a list of tuples
(pair RDD).
When a RDD is re-partitioned via partitionBy(~) (using a hash partitioner ), we guarantee that the
tuples with the same key end up in the same partition:
filter_none
new_rdd.glom ().collect ()
[[('C', 1)], [('A', 1), ('B', 1), ('A', 1), ('D', 1)]]
Indeed, we see that the tuple ('A',1) and ('A',1) lie in the same partition.
Let us now perform a map(~) operation with preservesPartitioning set to False (default):
filter_none
mapped_rdd.glom ().collect ()
[[('C', 4)], [('A', 4), ('B', 4), ('A', 4), ('D', 4)]]
Here, we are applying a map(~) that returns a tuple with the same key, but with a different value.
We can see that the partitioning has not changed. Behind the scenes, however, Spark internally
has a flag that indicates whether or not the partitioning has been destroyed, and this flag has now
been set to True (i.e. partitioning has been destroyed) due to setting preservesPartitioning=False by
default. This is naive of Spark to do so, since the tuples key have not been changed, and so the
partitioning should still be valid.
We can confirm that Spark is now naively unaware that the data is partitioned by the tuple key by
performing a shuffling operation like reduceByKey(~) :
filter_none
print (mapped_rdd_reduced.toDebugString().decode("utf-8"))
You can see that a shuffling has indeed occurred. However, this is completely unnecessary
because we know that the tuples with the same key reside in the same partition (machine), and so
this operation can be done locally.
print (mapped_rdd_preserved_reduced.toDebugString().decode("utf-8"))
We can see that no shuffling has occurred. This is because we tell Spark that we have only
changed the value of the tuple, and not the key, and so Spark should assume that the original
partitioning is kept intact.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's partitionBy(~) method re-partitions a pair RDD into the desired number of
partitions.
Parameters
1. numPartitions | int
The partitioning function - the input is the key and the return value must be the hashed value. By
default, a hash partitioner will be used.
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Repartitioning a pair RDD
rdd.collect ()
rdd.glom ().collect ()
new_rdd.glom ().collect ()
Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to
happen because the hash partitioner will perform bucketing based on the tuple key.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's reduceByKey(~) method aggregates the RDD data by key, and perform a reduction
operation. A reduction operation is simply one where multiple values become reduced to a single
value (e.g. summation, multiplication).
Parameters
1. func | function
By default, the number of partitions will be equal to the number of partitions of the parent RDD.
If the parent RDD does not have the partition count set, then the parallelism level in the PySpark
configuration will be used.
Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
Consider the following Pair RDD:
filter_none
rdd.collect ()
To group by key and perform a summation of the values of each grouped key:
filter_none
rdd.reduceByKey(lambda a, b: a+b).collect ()
By default, the number of partitions of the resulting RDD will be equal to the number of
partitions of the parent RDD:
filter_none
new_rdd.getNumPartitions ()
We can set the number of partitions of the resulting RDD by setting the numPartitions parameter:
filter_none
new_rdd.getNumPartitions ()
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's partitionBy(~) method re-partitions a pair RDD into the desired number of
partitions.
Parameters
1. numPartitions | int
The partitioning function - the input is the key and the return value must be the hashed value. By
default, a hash partitioner will be used.
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Repartitioning a pair RDD
rdd.collect ()
rdd.glom ().collect ()
new_rdd.collect ()
new_rdd.glom ().collect ()
Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to
happen because the hash partitioner will perform bucketing based on the tuple key.
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's reduceByKey(~) method aggregates the RDD data by key, and perform a reduction
operation. A reduction operation is simply one where multiple values become reduced to a single
value (e.g. summation, multiplication).
Parameters
1. func | function
By default, the number of partitions will be equal to the number of partitions of the parent RDD.
If the parent RDD does not have the partition count set, then the parallelism level in the PySpark
configuration will be used.
The partitioner to use - the input is a key and return value must be the hashed value. By default, a
hash partitioner will be used.
Return Value
A PySpark RDD ( pyspark.rdd.PipelinedRDD ).
Examples
Consider the following Pair RDD:
filter_none
rdd.collect ()
To group by key and perform a summation of the values of each grouped key:
filter_none
rdd.reduceByKey(lambda a, b: a+b).collect ()
By default, the number of partitions of the resulting RDD will be equal to the number of
partitions of the parent RDD:
filter_none
new_rdd.getNumPartitions ()
We can set the number of partitions of the resulting RDD by setting the numPartitions parameter:
filter_none
new_rdd.getNumPartitions ()
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's repartition(~) method splits the RDD into the specified number of partitions.
NOTE
When we first create RDDs, they will already be partitioned under the hood, which means that all
RDDs are already partitioned. This method is called repartition(~) (emphasis on the re ) because we
are changing the existing partitioning.
Parameters
1. numPartitions | int
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Re-partitioning a RDD with certain number of partitions
rdd.collect ()
Here, we are using the parallelize(~) method to create a RDD with 3 partitions.
We can use the glom() method to see the actual content of the partitions:
filter_none
rdd.glom ().collect ()
new_rdd = rdd.repartition(2)
new_rdd.glom ().collect ()
the same values do not necessarily end up in the same partition ( 'A' can be found in both
partitions)
the number of elements in each partition may also not be balanced - here we have 4
elements in the first partition, while only 2 elements in the second partition.
WARNING
The repartition(~) method involves shuffling link, even when reducing the number of partitions. To
avoid shuffling when reducing the number of partitions, use RDD's coalesce(~) method instead.
PySpark RDD | zip method
schedule AUG 12, 2023
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's zip(~) method combines the elements of two RDDs into a single RDD of tuples.
Parameters
1. other | RDD
Return Value
A new PySpark RDD.
Examples
Combining two PySpark RDDs into a single RDD of tuples
Here, we are using the parallelize(~) method to create two RDDs, each having 3 partitions.
We can see the actual values in each partition using the glom(~) method:
filter_none
x.glom ().collect ()
We see that RDD x indeed has 3 partitions, and we have 2 elements in each partition. The same
can be said for RDD y :
filter_none
y.glom ().collect ()
[[10, 11], [12, 13], [14, 15]]
We can combine the two RDDs x and y into a single RDD of tuples using the zip(~) method:
filter_none
zipped_rdd = x.zip(y)
zipped_rdd.collect ()
[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15)]
WARNING
In order to use the zip(~) method, the two RDDs must have the exact same number of partitions as
well as the exact same number of elements in each partition.
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark RDD's zipWithIndex(~) method returns a RDD of tuples where the first element of the tuple
is the value and the second element is the index. The first value of the first partition will be given
an index of 0.
Parameters
This method does not take in any parameters.
Return Value
A new PySpark RDD.
Examples
Consider the following PySpark RDD with 2 partitions:
filter_none
rdd = sc.parallelize (['A','B','C'], 2)
rdd.collect ()
We can see the content of each partition using the glom() method:
filter_none
rdd.glom ().collect ()
We see that we indeed have 2 partitions with the first partition containing the value 'A' , and the
second containing the values 'B' and 'C' .
We can create a new RDD of tuples containing positional index information using zipWithIndex(~) :
filter_none
new_rdd = rdd.zipWithIndex()
new_rdd.collect ()
We see that the index position is assigned based on the partitioning position - the first element of
the first partition will be assigned the 0th index.
R E L AT E D
local_offer
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SparkContext's parallelize(~) method creates a RDD (resilient distributed dataset) from the
given dataset.
Parameters
1. c | any
The data you want to convert into RDD. Typically, you would pass a list of values.
The number of partitions to use. By default, the parallelism level set in the Spark configuration
will be used for the number of partitions:
filter_none
Return Value
A PySpark RDD ( pyspark.rdd.RDD ).
Examples
Creating a RDD with a list of values
rdd = sc.parallelize(["A","B","C","A"])
rdd.collect ()
rdd.getNumPartitions ()
rdd.collect ()
Here, Spark is partitioning our list into 3 sub-datasets. We can see the content of each partition
using the glom() method:
filter_none
rdd.glom ().collect ()
rdd.collect ()
Note that parallelize will not perform partitioning based on the key, as shown here:
filter_none
rdd.glom ().collect ()
We can see that just like the previous case, the partitioning is done using the ordering of the list.
NOTE
pyspark.rdd.RDD
What makes pair RDDs special is that, we can perform additional methods such as reduceByKey(~) ,
which performs a groupby on the key and perform a custom reduction function:
filter_none
new_rdd.collect ()
import pandas as pd
df_pandas = pd.DataFrame ({"A":[3,4],"B":[5,6]})
df_pandas
A B
0 3 5
1 4 6
rdd = df_spark.rdd
rdd.collect ()
Notice how only the values of the DataFrame are kept - column labels are not included in the
RDD.
WARNING
Even though parallelize(~) can accept a Pandas DataFrame directly, this does not give us the desired
RDD:
filter_none
import pandas as pd
rdd = sc.parallelize(df_pandas)
rdd.collect ()
['A', 'B']
As you can see, the rdd only contains the column labels but not the data itself.
local_offer
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark's createDataFrame(~) method creates a new DataFrame from the given list, Pandas
DataFrame or RDD.
Parameters
1. data | list-like or Pandas DataFrame or RDD
If the data type is not provided via schema , then samplingRatio indicates the proportion of rows to
sample from when making inferences about the column type. By default, only the first row will be
used for type inference.
Whether or not to check the data against the given schema. If data type does not align, then an
error will be thrown. By default, verifySchema=True .
Return Value
A PySpark DataFrame.
Examples
Creating a PySpark DataFrame from a list of lists
df = spark.createDataFrame(rows)
df.show ()
+----+---+
| _1| _2|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To create a PySpark DataFrame from a list of lists with the column names specified:
filter_none
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
vals = [3,4,5]
spark.createDataFrame(vals, IntegerType()).show()
+-----+
|value|
+-----+
| 3|
| 4|
| 5|
+-----+
Here, the IntegerType() indicates that the column is of type integer - this is needed in this case,
otherwise PySpark will throw an error.
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
df = spark.createDataFrame(data)
df.show ()
+---+----+
|age|name|
+---+----+
| 20|Alex|
| 30| Bob|
+---+----+
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
import pandas as pd
df = pd.DataFrame ({"A":[3,4],"B":[5,6]})
df
A B
0 3 5
1 4 6
pyspark_df = spark.createDataFrame(df)
pyspark_df.show ()
+---+---+
| A| B|
+---+---+
| 3| 5|
| 4| 6|
+---+---+
To create PySpark DataFrame while specifying the column names and types:
filter_none
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())])
df = spark.createDataFrame(rows, schema)
df.show ()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+
To create a PySpark DataFrame with date columns, use the datetime library:
filter_none
import datetime
df.show ()
+----+----------+
|name| birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+
Specifying verifySchema
By default, verifySchema=True , which means that an error is thrown if there is a mismatch in the
type indicated by the schema and the type inferred from data :
filter_none
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])
df.show ()
org.apache.spark.api.python.PythonException:
'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'
Here, an error is thrown because the inferred type of column name is string , but we have specified
the column type to be integer in our schema .
By setting verifySchema=False , PySpark will fill the column with nulls instead of throwing an error:
filter_none
schema = StructType([
StructField("name", IntegerType()),
StructField("age", IntegerType())])
df.show ()
+----+---+
|name|age|
+----+---+
|null| 25|
|null| 30|
+----+---+
local_offer
PySpark
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
PySpark SparkSession's range(~) method creates a new PySpark DataFrame using a series of values
- this method is similar to Python's standard range(~) method.
Parameters
1. start | int
Return Value
A PySpark DataFrame.
Examples
Creating a PySpark DataFrame using range (series of values)
To create a PySpark DataFrame that holds a series of values, use the range(~) method:
filter_none
df = spark.range(1,4)
df.show ()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
Notice how the starting value is included while the ending value is not.
Note that if only one argument is supplied, then the range will start from 0 (inclusive) and the
argument will represent the end-value (exclusive):
filter_none
df = spark.range(3)
df.show ()
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
Instead of the default incremental value of step=1 , we can choose a specific incremental value
using the third argument:
filter_none
df = spark.range(1,6,2)
df.show ()
+---+
| id|
+---+
| 1|
| 3|
| 5|
+---+
df = spark.range(4,1,-1)
df.show ()
+---+
| id|
+---+
| 4|
| 3|
| 2|
+---+
By default, the number of partitions in which the resulting PySpark DataFrame will be split is
governed by our PySpark configuration. In my case, the default number of partitions is 8:
filter_none
df = spark.range(1,4)
df.rdd .getNumPartitions ()
df = spark.range(1,4, numPartitions=2)
df.rdd .getNumPartitions ()