APACHE SPARK and Scala

Copyright© Scalebyte 1
ScaleByte
APACHE SPARK AND SCALA
www.scalebyte.com
Introduction
2
 Spark is a framework that allows distributed

processing and also provides in-memory
computation power.
 It’s a open source project, used for fast data
analytics
 It is one of the apaches top level project
 It provides high level API’s in JAVA, python and
scala with rich built-in library
Copyright© Scalebyte
Introduction Contd..
3
 Spark runs on cluster and can access any data

source on HDFS as well as Cassandra
 There is a need for fast processing as nowadays
waiting for a long time online / offline for getting
the results is inacceptable, hence spark is
introduced
 It provides you with high level tools like spark
SQL, and MLib for machine learning purpose
Batch Vs Real-time (Stream) Scenario
4
Copyright ©
5
Copyright ©
6
Copyright ©
Analytics Type Based on Input Data
7
Copyright ©
Spark real time streaming
8
 Chop up the live stream into

batches of X seconds
 Spark treats each batch of
data as RDDs and processes
them using RDD operations
 Finally, the processed results
of the RDD operations are
returned in batches
Why spark?
9
 It’s a powerful and scalable open source processing engine

 Features
 Provides in-memory computation
 Uses RDD’s
 Works on HDFS, S3, Cassandra etc
 Spark scheduling of jobs is faster than mapreduce
 It has a new world record
In memory Computations
10
Spark unified stack
11 Copyright© Scalebyte
What is RDD
12
Resilient Distributed dataset [primary core abstraction]

 A collection of huge data with the following properties
 Immutable read only
 Fault tolerant due to RDD lineage DAG
 Distributed and partitioned across the cluster
 Lazily evaluated
 Type inferred
 Cacheable
RDD Lineage — Logical Execution
Plan
13
 It is built as a result of applying

transformations to the RDD and creates
a Logical Execution Plan.
RDD Basics
14
 In Spark all work is expressed as either creating new

RDDs, transforming existing RDDs, or calling
operations (Actions) on RDDs to compute a result
 There are two ways of creating RDD’s
 By loading from external dataset
 val Lines = sc.textFile(“Readme.md”)
 By distributing a collection of elements/objects
 val data = 1 to 100
 val dataRDD = sc.parallelize(data)
RDD basics
15
 Once created there are two kind of operations that can be

performed on RDD’s
 Transformations
 Apply functions to create a new RDD from the existing one
 Creates DAG
 Lazily evaluated
 Does not return value
 Actions
 Compute results by applying functions on RDD
 It returns value
RDD basics
16
 In some cases you wish to use contents of RDD

repeatedly then in such cases you can persist the
data by using persist function
 lines = sc.textFile(“Readme.txt”)
 pythonLines=lines.filter(lambda line: "Python" in line)
 pythonLines.count( )
 pythonLines.persist( )
 pythonLines.first( )
RDD Operations (Transformations)
17
 Transformations
 Transformations are operations that return new RDD’s
eg map(), filter() etc
 filter() transformation in Scala
 val inputRDD = sc.textFile("log.txt")
 val errorsRDD = inputRDD.filter(line =>line.contains("error"))
 filter() transformation in Python

 inputRDD = sc.textFile("log.txt")
 errorsRDD = inputRDD.filter(lambda x: "error" in x)
RDD Operations(Transformations)
18
 let’s use inputRDD again to search for lines with the

word “warning”
 we’ll use another transformation, union(), to print out
the number of lines that contained either “error” or
“warning”
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
RDD Operations(Transformations)
19
 union() is a bit different than filter(), in that

it operates on two RDDs instead of one.
RDD Operations (Actions)
20
 Actions
 Actions are operations performed on RDD
that return results to the driver program, or
they can be stored into some storage Eg
count(), first() etc
 Suppose we might want to print out some
information about the badLinesRDD, see
the examples inCopyright©
the next slide
Scalebyte
21
Scala error count using actions

println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
Python error count using actions

print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line
22
Note: Its important to know that each
time we call a new action the RDD is
computed from scratch, this is indeed time
consuming
That is the reason we have persist() /
cache() action available that will help us to
store the intermediate results / RDD’s into
memory for further use Scalebyte
Copyright©
Lazy Evaluation
23
Transformations on RDDs are lazily evaluated,

meaning that spark will not begin to execute
until it sees an action.
For the below statements spark will not come in picture because
none of the
errorsRDD three statements have
= inputRDD.filter(lambda x: action operation
"error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning"
in x)
badLinesRDD = errorsRDD.union(warningsRDD)
Passing functions to spark
24
word = rdd.filter(lambda s: "error" in s)
def containsError(s):
return "error" in s
word = rdd.filter(containsError)
RDD transformations
25
RDD Transformations
26
RDD Transformations
27
RDD ACTIONS
28
RDD ACTIONS
29
Quick architectural overview
30
 Your program acts as a driver

 So is your spark shell
 The driver program is just one part of the
spark application
SPARK Architecture
31
 Spark application starts on two

node cluster 3
mast
worker
 The driver program contacts er
4
the master for resources 2 3 execu

ter T
 Next the master contacts the T
worker nodes 1
driver 5
 Worker nodes create executers worker
4
 Executers now directly come in
execu
contact with the driver nodes terT
T
and the further communication
happens between driver and
executer node
Major Industries leveraging Analytics
33
Copyright ©
Major Industries leveraging Analytics
34
Copyright ©
Before We Go Ahead
35
Copyright ©
Before We Go Ahead
36
Our area of Interest is Real-Time Analytics
Lets Explore Tools which can give low latency and

high throughput for Real Time Analytics
Copyright ©
Most Popular Real-Time Analytics Tool
37
Copyright ©
Idle Tool Real-Time Analytics
38
Copyright ©
Apache Flink Idle Tool Real-Time Analytics
39
Copyright ©
Apache Flink
40
 Apache Flink is an open source platform which is a

streaming data flow engine that provides communication,
fault-tolerance, and data-distribution for distributed
computations over data streams
 Flink is a top level project of Apache.
 Flink is a scalable data Analyticsframework that is fully
compatible to Hadoop.
 Flink can execute both stream processing and batch
processing easily.
Apache Flink
41
 Its at its early state ie its still not been

explored by the many of analytics
community
 Most of the companies are migrating
 Version available for download Apache
Flink1.3.0
Features of Apache Flink
42
i. Low latency and High Performance

Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined
architecture provides the high throughput rate. It processes the data at lightening fast speed, it is also
called as 4G of Big Data.
ii. Fault Tolerance
The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots,
this mechanism provides the strong consistency guarantees.
iii. Memory Management
The memory management in apache flink provides control on how much memory is used by certain
runtime operations.
iv. Iterations
Apache Flink provides the dedicated support for iterative algorithms (machine learning, graph processing)
v. Integration
Apache Flink can be easily integrated with other open source data processing ecosystem. It can be
integrated with Hadoop, streams data from Kafka, It can be run on YARN.
The Strength of Flink comes from its Architecture
43
44
Lambda Architecture
45
Lambda Architecture
46
47
48
Other Users of Apache Flink
49
Conclusion: Now is the Time for Apache Flink
50
Big Data
Processing Tool

APACHE SPARK and Scala

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

APACHE SPARK and Scala

Uploaded by

Copyright:

Available Formats

Copyright© Scalebyte 1

 Spark is a framework that allows distributed

 Spark runs on cluster and can access any data

 Chop up the live stream into

 It’s a powerful and scalable open source processing engine

 Works on HDFS, S3, Cassandra etc

 Spark scheduling of jobs is faster than mapreduce

 It has a new world record

Resilient Distributed dataset [primary core abstraction]

 It is built as a result of applying

 In Spark all work is expressed as either creating new

 Once created there are two kind of operations that can be

 In some cases you wish to use contents of RDD

 filter() transformation in Python

 let’s use inputRDD again to search for lines with the

 union() is a bit different than filter(), in that

Scala error count using actions

Python error count using actions

Transformations on RDDs are lazily evaluated,

word = rdd.filter(lambda s: "error" in s)

 Your program acts as a driver

 Spark application starts on two

the master for resources 2 3 execu

Our area of Interest is Real-Time Analytics

Lets Explore Tools which can give low latency and

 Apache Flink is an open source platform which is a

 Its at its early state ie its still not been

i. Low latency and High Performance

You might also like