Professional Documents
Culture Documents
Grupo de Bioingeniería
y Telemedicina
Big Data
SPARK and Sparklyr
Contents
Review
Apache Spark
Hadoop vs Spark
Apache Spark Components
Programming model in Spark: RDD
Review: Hadoop
reads
value pairs and output
(The, 1) (cardiovascular, 1)
sequential
[...] the effect of antihypertensive
(effect, 1) (cardiovascular, 1)
treatment on mortality, cardiovascular
disease and coronary heart disease (of, 1) (SBP, 1) (cardiovascular,
was attenuated at lower BP levels. If (antihypertensive, (SBP, 1) 2)
SBP was more than 140mmHg,
treatment reduced all-cause and 1) (the, 1) (SBP, 2)
cardiovascular
cardiovascular disease,
mortality,
stroke,
(treatment, 1) (the, 1) (the, 2)
(on, 1) (treatment, 1) (treatment, 3)
Only
myocardial infarction and heart failure.
If SBP was less than 140mmHg, (mortality, 1) (treatment, 1) …
treatment increased the risk of
cardiovascular death. […] (cardiovascular, 1) (treatment, 1)
…. …
Hadoop
Perfect for a solution including sequential data processing
Each step would consist of a Map and a Reduce function with a single pass
over the input
But, for applications involving several passes over the input
High cost of maintaining the MapReduce operations at each step
The workflow of MapReduce requires constant writing to and reading from
the cluster memories
Slow down the system with added time taken for the memory
operations
Difficult integration of different third party tools
Hadoop vs Spark
Apache Spark
Executes on a set up similar to HDFS
Achieved improved performance
Provides added functionalities
It shares the data across data structures in-memory
It allows to process the same dataset in parallel
It includes utilities such as data analysis and processing
Hadoop vs Spark
Hadoop data sharing
Iterative Operations
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Spark Components
Apache Spark Components
http://spark.apache.org
Apache Spark Components
http://spark.apache.org/docs/latest/cluster-overview.html
Spark Components
Driver program:
SparkContext object
Coordinates processes on a cluster
Allows connecting to several types of cluster managers
Allocate resources across applications
Once connected, Spark acquires executors on nodes in the
cluster
Processes that run computations and store data
Next, it sends the application code to the executors
Finally, SparkContext sends tasks to the executors to run
Spark Workers
Worker
Any node that can run application code in the cluster or in local
threads
Executor
A process launched for an application on a worker node, that
runs tasks and keeps data in memory or disk storage
Each application has its own executors
Programming model in Spark: RDD
Resilient Distributed Datasets (RDD)
Fundamental data structure of Spark
Immutable distributed collection of objects that can be stored in
memory or disk
A RDD is a read-only, partitioned collection of records
Can be computed on different nodes of the cluster and operated in
parallel
When RDDs are created, a DAG(Directed acyclic graph) is created
(transformation):
Transformations update the graph but nothing happens until actions are called
Invoke an action:
Get the total length of all the lines-> Applies transformation + Action
Return value to the caller
val totalLengths = lineLengths.reduce ((a,b) => a + b)
RDD basic operations
Reduce by the key adding up all the values of the same key
.reduceByKey((a,b) => a + b)
Driver
val errors =
messages.filter(_contains(“php)).count()
Example Log Mining
Driver
val errors =
Driver
val errors =
messages.filter(_contains(“php)).count()
Example Log Mining
Driver
val errors =
messages.filter(_.contains(“mysql)).count()
transformations: split lines by tab
messages.filter(_contains(“php)).count()
A cache is created into memory
Example Log Mining
Driver
val errors =
messages.filter(_.contains(“mysql)).count()
Pointer to cache data
messages.filter(_contains(“php)).count()
Example Log Mining
Driver
val errors =
Driver
val errors =
messages.filter(_.contains(“mysql)).count()
Use data on the cache to process
the second action
messages.filter(_contains(“php)).count()
Cache of data provides faster results:
Example Log Mining 1 TB of log data -> 5-7 sec. from cache vs. 170s. for
on-disk
Transformations: Actions:
Map Reduce
Filter Collect
GroupByKey Count
ReduceByKey Save
Sample LookupKey
Union ….
Join
Cache
….
RDD operations - transformations
Transformations are lazy evaluations: Results are not computed right away
Transformation Meaning https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
map(func) Return a new distributed dataset formed by passing each element of the source through a
function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func
returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func
should return a Seq rather than a single item)
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of
type Iterator<T> => Iterator<U> when running on an RDD of type T
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: using reduceByKey or aggregateByKey will yield much better performance.
reduceByKey(func, When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for
[numTasks]) each key are aggregated using the given reduce function func, which must be of type (V,V)
=> V.
sortByKey([ascending], When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of
[numTasks]) (K, V) pairs sorted by keys in ascending or descending order
join(otherDataset, When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with
[numTasks]) all pairs of elements for each key. Outer joins are supported with leftOuterJoin, rightOuterJoin
RDD operations - actions
Actions return values
Action Meaning http://spark.apache.org/docs/latest/programming-guide.html#transformations
reduce(func) Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative
so that it can be computed correctly in parallel
collect() Return all the elements of the dataset as an array at the driver program. This is
usually useful after a filter or other operation that returns a sufficiently small subset
of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeOrdered(n, [orderi Return the first n elements of the RDD using either their natural order or a custom
ng]) comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given
directory in the local filesystem, HDFS or any other Hadoop-supported file system
foreach(func) Run a function func on each element of the dataset. This is usually done for side
effects such as updating an Accumulator or interacting with external storage
systems.
SparkContext
https://spark.rstudio.com
R vs Sparklyr
https://therinspark.com/analysis.html
R vs Sparklyr
https://therinspark.com/analysis.html
R vs Sparklyr
http://spark.apache.org/docs/latest/cluster-overview.html
Spark Context in sparklyr
The entry point into all relational functionality in Spark is the SQLContext:
SparkSQL is a Spark module for structured data processing
Sparklyr works solely with DataFrames, which require SparkSQL
It translates dplyr functions and vector expressions into SQL
Example:
Load the package:
library(sparklyr)
Connect to Spark:
spark_conn <- spark_connect("local")
Disconnect from Spark:
spark_disconnect(spark_conn)
*To see how the requested configuration affected the Spark connection, go to the
Executors page in the Spark Web UI available in http://localhost:4040/
Data manipulation in Sparklyr: DataFrames
Filter data:
filter(flights, month == 1, day == 1)
Arrange rows (re-order rows):
arrange(flights, year, month, day)
Select columns by name
select(flights, year, month, day)
Extract distinct (unique) rows:
distinct(flights, tailnum)
Add new columns
mutate(flights, speed = distance / air_time * 60)
Summarize values to a single row:
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
Data Manipulation based on package dplyr (III)
Group_by()
It breaks down a dataset into specified groups of rows
When a function is applied on the resulting object, it is automatically applied “by group”
arrange() orders first by the grouping variables
sample_n() and sample_frac() sample the specified number/fraction of rows in
each group
sample_n(flights, 10)
slice() extracts rows within each group
summarize() is used with aggregate functions (min, max, sum, mean, sd, median,…)
Take a vector of values and return a single number
summarize(per_day, flights = sum(flights)))
Use of piping with dplyr
B)
c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
arrange(carrier) %>%
mutate(air_time_hours = air_time / 60)
Copy data into Spark
src_tbls()
Machine Learning in Sparklyr
MLlib is Spark’s machine learning library
In Sparklyr it is provided by the spark.ml package
Machine Learning workflows can be created and tuned together
with sparklyr’s dplyr interface
Sparklyr provides three families of functions:
Machine learning algorithms for analyzing data (ml_*)
Feature transformers for manipulating individual features (ft_*)
Functions for manipulating Spark DataFrames (sdf_*)
https://spark.rstudio.com/mlib/
Sparklyr – ML functions
Sparklyr - Transformers
https://spark.rstudio.com/packages/sparklyr/latest/reference/#spark-feature-transformers
Sparklyr - Utilities
https://spark.rstudio.com/packages/sparklyr/latest/reference/#spark-ml---evaluation
Example A – Data Manipulation based on dplyr
sdf_register(sql2, "jfk2sfo")
Collecting to R
You can copy data from Spark into R’s memory by using collect()
collect() executes the Spark query and returns the results to R for further analysis and
visualization
For example, to plot the data from the Hive data store:
ggplot(jfk2sfo, aes(month, air_time_hours, group=month)) + geom_boxplot()
Sparklyr – Example B
Iris dataset: measures attributes for 150 flowers in 3 different species of iris
library(sparklyr)
library(ggplot2)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
Sparklyr – Example B(II)
https://spark.rstudio.com/images/homepage/spar
klyr.pdf
Install sparklyr
Install Java 8
Hadoop Wiki
Introduction
http://wiki.apache.org/lucene-hadoop/
Getting Started
http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
Map/Reduce Overview
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
References (3)
“Learning Spark” Lightning-Fast Big Data Analysis. Holden Karau,
Andy Konwinski, Patrick Wendell, Matei Zaharia, O’Reilly Media, 2015
“Mastering Spark with R”, Javier Luraschi, Kevin Kuo, Edgar Ruiz, 2021
http://therinspark.com
Apache Spark Tutorial
https://www.tutorialspoint.com/apache_spark/index.htm
Apache Spark References:
http://spark.apache.org/docs/latest/