You are on page 1of 11

Lesson 7

Extract, Transform and Load


Process

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 1
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
ETL process three functions
• Extract which does the acquisition of
data from Data Store querying or from
another program,
• Transform which does the change of data
into a desired file, columnar, tabular or other.
• Load which does the process of
placing transformed data into another
Data Store or data warehouse
“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 2
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Transform Functions
• join(), groupBy(), cogroup(), filter(),
map(), mapValues(), flatMap(), sort(),
pratitionBy(), groupByKey(),
reduceByKey(), aggregateByKey(),
pipe(), coalesce(), sample(), union(),
crossProduct()

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 3
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Spark 2.3 with Pandas
• Includes transformation functions on
complex objects like arrays, maps and
set of columns
• Pandas provide powerful
transformation UDFs, VUDFs and
GVUDFs

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 4
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Figure 5.9: An ETL pipeline using Spark SQL for
ETL Process and Data Source API v2 in Spark 2.3.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 5
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Extract
• Skipping Corrupt or Bad Records
or Files

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 6
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Extract and Load
• Multi-line JSON/CSV Support
• Load and Save files: SerDe uses codes
for obtaining records from
unstructured data
• Save process uses serializer codes
• Loading (extracting) process uses
deserializer.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 7
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example for Load and Save
• Example 5.13 explains the codes for
sequence File, JSON and CSV file
load and save functions for obtaining
records/rows/files

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 8
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Example
• Example 5.14 explains Spark SQL
transformations in Spark 2.3
• Complex objects, nested tables (one
column rows) and array
transformations
• Using the DataframeWriter API.

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 9
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
Summary
• Extract, Transform and Load
• Transform functions
• Load and Save
• Spark 2.3 includes transformation
functions on complex objects like
arrays, maps and set of columns
• Pandas provide powerful transformation
UDFs, VUDFs and GVUDFs
“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 10
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)
End of Lesson 7 on
Applications and Big Data
analytics using Spark

“Big Data Analytics “, Ch.05 L07: Spark and Big Data Analytics
2019 11
Raj Kamal, and Preeti Saxena © McGraw-Hill Education (India)

You might also like