Professional Documents
Culture Documents
Slides
Slides
Kashif Murtaza
Muhammad Ahmad
AI Sciences Instructor
@AISciencesLearn
BIG DATA
Prerequisites
Website:
www.aisciences.io
Applications of Spark
Applications of Spark
▪ Streaming Data
Applications of Spark
▪ Streaming Data
▪ Machine Learning
Applications of Spark
▪ Streaming Data
▪ Machine Learning
▪ Batch Data
Applications of Spark
▪ Streaming Data
▪ Machine Learning
▪ Batch Data
▪ ETL Pipelines
Applications of Spark
▪ Streaming Data
▪ Machine Learning
▪ Batch Data
▪ ETL Pipelines
▪ Full load and Replication on going
Your Instructor
MUHAMMAD AHMAD
(Cloud and Big Data Engineer)
What’s Inside?
Methodology
Projects
Student Data Analysis
Employee Data Analysis
Collaborative Filtering
Spark Streaming
ETL Pipeline
Full Load and Replication on Going
Spark
Why Spark?
▪ Speed
▪ Distributed
▪ Advanced Analytics
▪ Real Time
▪ Powerful Caching
▪ Fault Tolerant
▪ Deployment
HADOOP
HADOOP
YARN
HDFS
Spark Architecture
Spark Architecture
Workers
Driver Node
Workers
Spark Ecosystem
Spark Ecosystem
SPARK SPARK
SPARK SQL SPARK MLlib
STREAMING GRAPHX
User 1 1 5 4 N/A
User 4 4 4 2 5
Explicit and Implicit Ratings
Expected Results
Expected Results
1 1 4.8
1 22 5
2 12 4
2 11 3.9
Hands On
Dataset Overview
Joining DFs
Create Train and Test Data
ALS model
Hyperparameter tuning and cross
validation
Best model and evaluate
predictions
Recommendations
Spark Streaming
Spark Streaming With RDD
Spark Streaming With DF
ETL Pipeline
Data Extraction csv
csv
txt txt
ETL
jdbc jdbc
Data Load
….. …..
PySpark on Postgres
CSV in DBFS EXT DataBricks LD Database in AWS
NoteBook RDS
Transformation
Data Set
Extract
Transform
Installing Postgresql
Load
Project
CDC - Change Data Capture /
Replication On Going
Project Architecture
Project Architecture
Source RDS -> MySql
Endpoint
DMS
I
n
v
R
EA
o
D
k
e
FINAL READ
Glue -> PySpark
HDFS / S3 WRITE