You are on page 1of 49

Copyright© Scalebyte 1

ScaleByte
APACHE SPARK AND SCALA

www.scalebyte.com
Introduction
2

 Spark is a framework that allows distributed


processing and also provides in-memory
computation power.
 It’s a open source project, used for fast data
analytics
 It is one of the apaches top level project
 It provides high level API’s in JAVA, python and
scala with rich built-in library
Copyright© Scalebyte
Introduction Contd..
3

 Spark runs on cluster and can access any data


source on HDFS as well as Cassandra
 There is a need for fast processing as nowadays
waiting for a long time online / offline for getting
the results is inacceptable, hence spark is
introduced
 It provides you with high level tools like spark
SQL, and MLib for machine learning purpose
Copyright© Scalebyte
Batch Vs Real-time (Stream) Scenario
4

Copyright ©
Batch Vs Real-time (Stream) Scenario
5

Copyright ©
Batch Vs Real-time (Stream) Scenario
6

Copyright ©
Analytics Type Based on Input Data
7

Copyright ©
Spark real time streaming
8

 Chop up the live stream into


batches of X seconds
 Spark treats each batch of
data as RDDs and processes
them using RDD operations
 Finally, the processed results
of the RDD operations are
returned in batches
Copyright© Scalebyte
Why spark?
9

 It’s a powerful and scalable open source processing engine


 Features
 Provides in-memory computation

 Uses RDD’s

 Works on HDFS, S3, Cassandra etc

 Spark scheduling of jobs is faster than mapreduce

 It has a new world record

Copyright© Scalebyte
In memory Computations
10

Copyright© Scalebyte
Spark unified stack

11 Copyright© Scalebyte
What is RDD
12

Resilient Distributed dataset [primary core abstraction]


 A collection of huge data with the following properties
 Immutable read only
 Fault tolerant due to RDD lineage DAG
 Distributed and partitioned across the cluster
 Lazily evaluated
 Type inferred
 Cacheable
Copyright© Scalebyte
RDD Lineage — Logical Execution
Plan
13

 It is built as a result of applying


transformations to the RDD and creates
a Logical Execution Plan.

Copyright© Scalebyte
RDD Basics
14

 In Spark all work is expressed as either creating new


RDDs, transforming existing RDDs, or calling
operations (Actions) on RDDs to compute a result
 There are two ways of creating RDD’s
 By loading from external dataset
 val Lines = sc.textFile(“Readme.md”)
 By distributing a collection of elements/objects
 val data = 1 to 100
 val dataRDD = sc.parallelize(data)
Copyright© Scalebyte
RDD basics
15

 Once created there are two kind of operations that can be


performed on RDD’s
 Transformations
 Apply functions to create a new RDD from the existing one
 Creates DAG
 Lazily evaluated
 Does not return value
 Actions
 Compute results by applying functions on RDD
 It returns value

Copyright© Scalebyte
RDD basics
16

 In some cases you wish to use contents of RDD


repeatedly then in such cases you can persist the
data by using persist function
 lines = sc.textFile(“Readme.txt”)
 pythonLines=lines.filter(lambda line: "Python" in line)
 pythonLines.count( )
 pythonLines.persist( )
 pythonLines.first( )
Copyright© Scalebyte
RDD Operations (Transformations)
17

 Transformations
 Transformations are operations that return new RDD’s
eg map(), filter() etc
 filter() transformation in Scala
 val inputRDD = sc.textFile("log.txt")
 val errorsRDD = inputRDD.filter(line =>line.contains("error"))

 filter() transformation in Python


 inputRDD = sc.textFile("log.txt")
 errorsRDD = inputRDD.filter(lambda x: "error" in x)
Copyright© Scalebyte
RDD Operations(Transformations)
18

 let’s use inputRDD again to search for lines with the


word “warning”
 we’ll use another transformation, union(), to print out
the number of lines that contained either “error” or
“warning”
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
Copyright© Scalebyte
RDD Operations(Transformations)
19

 union() is a bit different than filter(), in that


it operates on two RDDs instead of one.

Copyright© Scalebyte
RDD Operations (Actions)
20

 Actions
 Actions are operations performed on RDD
that return results to the driver program, or
they can be stored into some storage Eg
count(), first() etc
 Suppose we might want to print out some
information about the badLinesRDD, see
the examples inCopyright©
the next slide
Scalebyte
RDD Operations (Actions)
21

Scala error count using actions


println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)

Python error count using actions


print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
print line
Copyright© Scalebyte
22
RDD Operations (Actions)
Note: Its important to know that each
time we call a new action the RDD is
computed from scratch, this is indeed time
consuming
That is the reason we have persist() /
cache() action available that will help us to
store the intermediate results / RDD’s into
memory for further use Scalebyte
Copyright©
Lazy Evaluation
23

Transformations on RDDs are lazily evaluated,


meaning that spark will not begin to execute
until it sees an action.
For the below statements spark will not come in picture because
none of the
errorsRDD three statements have
= inputRDD.filter(lambda x: action operation
"error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning"
in x)
badLinesRDD = errorsRDD.union(warningsRDD)

Copyright© Scalebyte
Passing functions to spark
24

word = rdd.filter(lambda s: "error" in s)

def containsError(s):
return "error" in s
word = rdd.filter(containsError)

Copyright© Scalebyte
RDD transformations
25

Copyright© Scalebyte
RDD Transformations
26

Copyright© Scalebyte
RDD Transformations
27

Copyright© Scalebyte
RDD ACTIONS
28

Copyright© Scalebyte
RDD ACTIONS
29

Copyright© Scalebyte
Quick architectural overview
30

 Your program acts as a driver


 So is your spark shell
 The driver program is just one part of the
spark application

Copyright© Scalebyte
SPARK Architecture
31

 Spark application starts on two


node cluster 3
mast
worker
 The driver program contacts er
4

the master for resources 2 3 execu


ter T
 Next the master contacts the T

worker nodes 1
driver 5
 Worker nodes create executers worker
4
 Executers now directly come in
execu
contact with the driver nodes terT
T
and the further communication
happens between driver and
executer node

Copyright© Scalebyte
Major Industries leveraging Analytics
33

Copyright ©
Major Industries leveraging Analytics
34

Copyright ©
Before We Go Ahead
35

Copyright ©
Before We Go Ahead
36

Our area of Interest is Real-Time Analytics

Lets Explore Tools which can give low latency and


high throughput for Real Time Analytics

Copyright ©
Most Popular Real-Time Analytics Tool
37

Copyright ©
Idle Tool Real-Time Analytics
38

Copyright ©
Apache Flink Idle Tool Real-Time Analytics
39

Copyright ©
Apache Flink
40

 Apache Flink is an open source platform which is a


streaming data flow engine that provides communication,
fault-tolerance, and data-distribution for distributed
computations over data streams
 Flink is a top level project of Apache.
 Flink is a scalable data Analyticsframework that is fully
compatible to Hadoop.
 Flink can execute both stream processing and batch
processing easily.
Copyright© Scalebyte
Apache Flink
41

 Its at its early state ie its still not been


explored by the many of analytics
community
 Most of the companies are migrating
 Version available for download Apache
Flink1.3.0

Copyright© Scalebyte
Features of Apache Flink
42

i. Low latency and High Performance


Apache Flink provides high performance and Low latency without any heavy configuration. Its pipelined
architecture provides the high throughput rate. It processes the data at lightening fast speed, it is also
called as 4G of Big Data.
ii. Fault Tolerance
The fault tolerance feature provided by Apache Flink is based on Chandy-Lamport distributed snapshots,
this mechanism provides the strong consistency guarantees.
iii. Memory Management
The memory management in apache flink provides control on how much memory is used by certain
runtime operations.
iv. Iterations
Apache Flink provides the dedicated support for iterative algorithms (machine learning, graph processing)
v. Integration
Apache Flink can be easily integrated with other open source data processing ecosystem. It can be
integrated with Hadoop, streams data from Kafka, It can be run on YARN.

Copyright© Scalebyte
The Strength of Flink comes from its Architecture

43
44
Lambda Architecture
45
Lambda Architecture
46
47
48
Other Users of Apache Flink
49
Conclusion: Now is the Time for Apache Flink
50

Big Data
Processing Tool

You might also like