Professional Documents
Culture Documents
SparkIntro PDF
SparkIntro PDF
US Secretary of Defense
Donald Rumsfeld
was once asked a
tricky question
There are known knowns.
These are things we know
that we know.
There are known unknowns. That is to say, there are things that we
know we don't know.
But there are also unknown unknowns. There are things we don't
know we don't know.
-Donald Rumsfeld
What seemed
to be a clever
evasion
There are known knowns. These are things we know that we know.
There are known unknowns. That is to say, there are things that we
know we don't know.
But there are also unknown unknowns. There are things we don't
know we don't know.
-Donald Rumsfeld
..went on to be
recognized as a
profound truth
Known knowns
Known unknowns
Unknown unknowns
This is a framework
that can be applied to
many things in life
Known knowns
Known unknowns
In Behavioral
Unknown unknowns
psychology,
it represents a
famous model about
personal awareness
Personal Awareness
Known knowns You know what you like and dislike
Unknown unknowns in
data can only be identified
through exploration and
investigation
Unknown unknowns Data Analysis
If your data is
small enough
Excel is an
excellent tool for
data analysis
Data Analysis ( Traditionally )
Data usually
resides in a
Database The data is
accessed using
SQL
Data Analysis ( Traditionally )
SQL
Database Python / R
SQL
an interactive
Database Python / R
environment
Programming languages like
Python, R, Scala etc provide a
Read-Evaluate-Print Loop
environment
Python / R
REPL
Read-Evaluate-Print Loop
An REPL is any shell/environment which
Takes in a single expression from the user
Evaluates it
Prints a result
Python / R
REPL
Read-Evaluate-Print Loop
An REPL environment for Python
len([1,2,5])
3
Python / R
REPL
Read-Evaluate-Print Loop
Iterate
Python / R
REPL
Read-Evaluate-Print Loop
Represent the data in a form
DoanallREPL,
of this with fast
With suitable for analysis
you feedback
could Explore and identify hidden
at every patterns
Confirm and quantify using
step! Statistical models, machine learning
Iterate
Data Analysis ( Traditionally )
SQL
Database Python / R
Once an interesting
and useful insight/
model is found
You want to do
something with it
Data Analysis ( Traditionally )
SQL
Database Python / R
Doing something
useful usually means
SQL
Database Python / R
Recommendation/fraud detection etc
SQL
Database Python / R
Recommendation/fraud detection etc
SQL
Database Python / R
SQL
Database It means that you
Python / R(or
another engineer)
SQL will end up re-
implementing all your
Java / C++ code in Java/C++
Data Analysis ( Traditionally )
SQL
Database Python / R
No Interactive
environment
Limitations of Hadoops
MapReduce engine
Everything has to be expressed as a chain of map and reduce tasks
No Interactive environment
Using Resilient
Distributed Datasets
RDD
Resilient Distributed Datasets
Lot of complexities
1. Distributing data across a cluster
2. Fault tolerance (if in-memory data is lost)
Spark Core
Spark Core
Spark Core
Spark Core
It needs t wo additional
components
APACHE SPARK
Spark Core
Spark Core
Storage Cluster
System manager
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Storage System Cluster manager
APACHE SPARK
Spark Spark
SQL Spark SQL provides
StreamingMLlib GraphX
an SQL interface
Spark Core
Storage System
for Spark
Cluster manager
APACHE SPARK
Spark Spark
SQL MLlib GraphX
Lots of folks are familiar
Streaming
with
SparkSQL
Coreand like to use to
Storage express
System data manipulation
Cluster manager
APACHE SPARK
Spark Spark
SQL With Spark
Streaming MLlib GraphX
SQL, folks can use
SQLSpark
to work with large amounts
Core
Storage of
System data stored in a cluster
Cluster manager
APACHE SPARK
Spark
You
Spark
can use the familiar SQL
SQL MLlib GraphX
and still get all the in-memory
Streaming
performance
Spark Core benefits of Spark
Storage System Cluster manager
Theres another use for
Spark SQL
APACHE SPARK
Lets say you were
Spark Spark
SQL
performing
MLlib GraphX some
Streaming
complex data
Spark Core
manipulations in a
Storage System Cluster manager
program
APACHE SPARK
Load data in table
Spark form
Spark into memory
SQL Streaming MLlib GraphX
SparkManipulate
Core it using
Storage System Sparkmanager
Cluster SQL
ie. with select, group by ,
joins etc
APACHE SPARK
Load data in table form into memory
Spark Manipulate
Spark it using Spark SQL
SQL Streaming MLlib GraphX
Convert it to Python/Java objects
Spark Core
and work on them
Storage System Cluster manager
APACHE SPARK
Spark Spark
Spark allows programmers to mix
SQL Streaming MLlib GraphX
SQL manipulations with Python/
Java data
Spark Coremanipulations on the
same
Storage System dataset and within
Cluster manager a single
program
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Lets say you have a
Storage System Cluster manager
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Whenever you have a large amount
Storage System Cluster manager
of data, Machine Learning is a great
way to identify patterns within it
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
MLlib
Storage provides Cluster
System built-inmanager
Machine
Learning functionality in Spark
APACHE SPARK
Spark Spark
This
SQLbasically solves
Streaming
2 MLlib GraphX
problems with previous
Spark
computing frameworks Core
Storage System Cluster manager
Abstraction
Performance
APACHE SPARK
Spark Spark
Abstraction
SQL Streaming MLlib GraphX
Spark Core
Machine Learning
Storage System algorithms
Cluster manager
are pretty complicated
APACHE SPARK
Abstraction
Spark
Python Spark
and R have
SQL Streaming MLlib GraphX
libraries which allow
you to plug and playSpark
ML Core
algorithms
Storage Systemwith These libraries
Cluster are not
manager
minimal effort suited for distributed
computing
APACHE SPARK
Spark Spark
Abstraction
SQL Streaming MLlib GraphX
Spark Core
Hadoop MapReduce is great for distributed
Storage System Cluster manager
computing, but it requires you to implement the
algorithms yourself as map and reduce tasks
APACHE SPARK
Spark Spark
Abstraction
SQL Streaming MLlib GraphX
Spark Core
MLlib has Built-in methods for
Storage System Cluster manager
Classification, regression, clustering
etc algorithms are provided
APACHE SPARK
Spark Spark
Abstraction
SQL Streaming MLlib GraphX
Spark Core
Under the hood the library takes
Storage System Cluster manager
care of running these algorithms
across a cluster
APACHE SPARK
Spark Spark
Abstraction
SQL Streaming MLlib GraphX
Spark Core
This completely abstracts the programmer from
Storage System Cluster manager
Implementing the ML algorithm
Intricacies of running it across a cluster
APACHE SPARK
Spark Spark
This
SQLbasically solves
Streaming
2 MLlib GraphX
problems with previous
Spark
computing frameworks Core
Storage System Cluster manager
Abstraction
Performance
APACHE SPARK
Performance
Machine Learning algorithms
Spark Spark
are iterative, which means
SQL Streaming MLlib GraphX
you need to make multiple
Spark Core
passes over the same data
Hadoop MapReduce is
Storage System Cluster manager
heavy on disk writes
which is not efficient for
Machine Learning
APACHE SPARK
Performance
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Since Sparks RDDs are in-memory
Storage System Cluster manager
It can make multiple passes over the
same data without doing disk writes
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
GraphX
Storage System is a library for
Cluster manager
graph algorithms
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Many interesting
Storage System Clusterdatasets
manager
can be represented as graphs
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Social
Storage System net works,
Cluster manager
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
With GraphX you can represent
Storage System Cluster manager
and then perform computations
across these datasets
APACHE SPARK
Spark Spark
Just like with MLlib, GraphX
SQL Streaming MLlib GraphX
abstracts the Spark
programmer
Core
from having to
Storage System
implement
Cluster manager
Graph algorithms
APACHE SPARK
Spark
Becauseof Spark
RDDs and in-memory
SQL Streaming MLlib GraphX
computation, GraphX on Spark
Spark Core
gives you better performance
than other
Storage computing
System Cluster manager
frameworks (like MapReduce)
APACHE SPARK
Spark Spark
SQL Streaming MLlib GraphX
Spark Core
Storage System Cluster manager
A result is requested
using an action
Transformation Action
To summarize, data is
processed only when the
user requests a result
Transformation Action
This is known as
Lazy Evaluation
Lazy Evaluation
This is what allows
Spark to perform
fast computations on
large amounts of data
Lazy Evaluation
Spark will keep a
record of the series of
transformations
requested by the user
Lazy Evaluation
When called upon to execute,
Spark takes these operations
and groups them in
an efficient way
More on this later
Lets now formally
understand RDDs