You are on page 1of 80

Master in Biomedical Engineering

Grupo de Bioingeniería
y Telemedicina

Big Data
SPARK and Sparklyr
Contents

 Review
 Apache Spark
 Hadoop vs Spark
 Apache Spark Components
 Programming model in Spark: RDD
Review: Hadoop

 Open-source implementation of MapReduce


 Hadoop Distributed File System (HDFS)
 Distributed file system that stores data providing high aggregated bandwidth
across the cluster
 YARN
 Resource management platform responsible for managing computing
resources in clusters and using them for scheduling the users’ applications
 MapReduce
 Programming model for large scale data processing
 Map: Distribute datasets among multiple servers and operation on the data
 Reduce: Partial results are recombined
MapReduce: Example Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Reads input and Collect all values
Collect all pairs with
produces a set of key- belonging to the key
same key

reads
value pairs and output

(The, 1) (cardiovascular, 1)

sequential
[...] the effect of antihypertensive
(effect, 1) (cardiovascular, 1)
treatment on mortality, cardiovascular
disease and coronary heart disease (of, 1) (SBP, 1) (cardiovascular,
was attenuated at lower BP levels. If (antihypertensive, (SBP, 1) 2)
SBP was more than 140mmHg,
treatment reduced all-cause and 1) (the, 1) (SBP, 2)
cardiovascular
cardiovascular disease,
mortality,
stroke,
(treatment, 1) (the, 1) (the, 2)
(on, 1) (treatment, 1) (treatment, 3)

Only
myocardial infarction and heart failure.
If SBP was less than 140mmHg, (mortality, 1) (treatment, 1) …
treatment increased the risk of
cardiovascular death. […] (cardiovascular, 1) (treatment, 1)
…. …

Big document (key, value) (key, value) (key, value)


HDFS: Data redundancy

 Failure of one Datanode


 Missing data for the files storing
a block in that node
 Solution
 Replicate each block three
times as it is stored in HDFS
 If a single node fails, there are
two other copies in the other
nodes
 The NameNode detects the
blocks that are under-
replicated and re-replicates
them on the cluster
Apache Spark (I)

 Open-source framework for cluster computing


 Development in the AMPLab at UC Berkeley in 2009
 Merged with Apache projects in 2010
 Apache Spark emerged as a fast and convenient solution to
perform complex analysis of large-scale data
 Reduce complexity when interacting with data
 Processing done in-memory
Speed increased even up to 100x faster
 High flexibility -> Supported by different platforms
 Provides high-level APIs for Java, Scala, Python and R
Apache Spark (II)

 Spark is not a modified version of Hadoop and is not dependent on


Hadoop
 It has its own cluster management computation
 Spark uses Hadoop for storage
 Spark extends the MapReduce model to efficiently use more types of
computations:
 Interactive Queries
 Stream Processing
 Complete solution to analyze data from different sources
 Real-time or batch processing
 Support various formats such as images, texts, graphs, etc.
Hadoop vs Spark
Hadoop vs Spark

Hadoop
Perfect for a solution including sequential data processing
Each step would consist of a Map and a Reduce function with a single pass
over the input
But, for applications involving several passes over the input
 High cost of maintaining the MapReduce operations at each step
 The workflow of MapReduce requires constant writing to and reading from
the cluster memories
Slow down the system with added time taken for the memory
operations
Difficult integration of different third party tools
Hadoop vs Spark

Apache Spark
 Executes on a set up similar to HDFS
 Achieved improved performance
 Provides added functionalities
It shares the data across data structures in-memory
It allows to process the same dataset in parallel
It includes utilities such as data analysis and processing
Hadoop vs Spark
Hadoop data sharing
Iterative Operations

 The only way to reuse data


between computations is to write
it to an external stable storage
system (E.g. HDFS)
 E.g. between 2 MapReduce jobs
 Both Iterative and Interactive
applications require faster data Interactive Operations
sharing across parallel jobs
 Most of the Hadoop applications
spend more than 90% of the time
doing HDFS read-write operations

Slow due to replication, serialization and disk IO


Iterative Operations
Spark data sharing
 In-memory processing
computation with RDDs
 Stores the state of memory as an
object across the jobs
 The object is sharable between
them
Interactive Operations
 Iterative operations
 Store intermediate results in a
distributed memory - makes the
system faster
 Interactive operations
 When different queries are run on
the same set of data repeatedly,
data can be kept in memory -
better execution times Data sharing in memory is 10 to 100 times faster than network and Disk
Spark performance vs Hadoop

 Example: Logistic Regression


 High processing speed
 Spark enables applications in Hadoop clusters to run up to 100 times faster in
memory, and 10 times faster when running on disk

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010).


“Spark: Cluster computing with working sets.” HotCloud, 10(10-10), 95
Sorting in Hadoop vs Spark

 Spark sorted 100TB of data on


disk in 23 minutes
 Better performance than
Hadoop on disk
 Apache Spark sorted the same
data 3X faster using 10X fewer
machines compared to
Hadoop
 First public cloud 1 Petabyte
sort
 Spark sets a record in large-
scale sorting

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Spark Components
Apache Spark Components

http://spark.apache.org
Apache Spark Components

Apache Spark Usage


Library
Spark Streaming It facilitates real-time streamed data processing in a
micro batch analysis manner
Spark SQL • Module for working with structured data
• Seamless mix SQL queries with Spark programs
or through JDBC and ODBC connectors
• Enables working with data in different formats
such as JSON, to perform a query
• Includes a cost-based optimizer, columnar
storage and code generation to make queries
fast
Apache Spark Components

Apache Spark Usage


Library
Spark MLlib Collection of common ML algorithm
implementations and utilities to perform related
operations

Spark Graph - X • Solution for graphs and computations involving


graphs
• Comparable performance to the fastest
specialized graph processing systems
Spark’s Machine Learning Toolkit (MLLib)

 MLLib and ML Pipelines: scalable, distributed ML libraries


 Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
 Persistence: saving and loading algorithms, models, and Pipelines
 Basic statistics: mean, median, correlation, hypothesis testing
 Features: feature extraction, selection, transformation, dimensionality
reduction
 Classification:
 SVM, Logistic Regression, Decision Trees, Naive Bayes, ...
 Regression: Linear, Lasso, Ridge, ...
 Clustering, and collaborative filtering
 Optimization
Spark Components

 Spark applications run as independent sets of processes on a cluster


 Driver program coordinates processes and schedules tasks on the cluster
 Worker programs run on cluster nodes or in local threads

http://spark.apache.org/docs/latest/cluster-overview.html
Spark Components

 Each application gets its own executor processes:


Benefits:
Allows isolating applications on:
The scheduling side: each driver schedules its own tasks
The executor side: tasks from different applications run in different
machines
Drawbacks:
Data cannot be shared across different Spark applications
(instances of SparkContext) without writing it to an external storage
system
Spark Driver

Driver program:
SparkContext object
Coordinates processes on a cluster
Allows connecting to several types of cluster managers
Allocate resources across applications
Once connected, Spark acquires executors on nodes in the
cluster
Processes that run computations and store data
Next, it sends the application code to the executors
Finally, SparkContext sends tasks to the executors to run
Spark Workers

Worker
Any node that can run application code in the cluster or in local
threads

Executor
A process launched for an application on a worker node, that
runs tasks and keeps data in memory or disk storage
Each application has its own executors
Programming model in Spark: RDD
Resilient Distributed Datasets (RDD)
 Fundamental data structure of Spark
 Immutable distributed collection of objects that can be stored in
memory or disk
 A RDD is a read-only, partitioned collection of records
 Can be computed on different nodes of the cluster and operated in
parallel
 When RDDs are created, a DAG(Directed acyclic graph) is created
(transformation):
 Transformations update the graph but nothing happens until actions are called

 Fault-tolerant collection of elements


DAG: Directed Acyclic Graph in Spark

 A set of Vertices and Edges


 Vertices represent the RDDs
 Edges represent the Operation to be applied on RDD
 In Spark DAG, every edge directs from earlier to later
in the sequence
 DAG operations can do better global optimization
than other systems like MapReduce
 The graphs can be replayed on nodes that need to
get back to the state it was before it went offline
Resilient Distributed Datasets (RDD)
 3 ways to create RDDs:
Parallelizing an existing collection in a driver program
Data resides in Spark and can be operated in parallel- Parallelized method
Referencing a dataset in an external storage system
E.g. Shared file system, HDFS, HBase, Cassandra, or any data source
supported by Hadoop
Transforming an existing RDD to create a new RDD
E.g. use filter method to select strings shorter than 20 characters
Resilient Distributed Datasets (RDD)
 Number of partitions to cut the dataset
 Spark tries to set the number of partitions automatically based on the
cluster characteristics
Spark will run one task for each partition
 Spark uses the concept of RDD to achieve:
 Faster MapReduce operations
 More efficient MapReduce operations

 RDD Execution process:


1. Automatically split work into many small, idempotent tasks
2. Send tasks to nodes based on data locality
3. Load-balance dynamically as tasks finish
Example: create RDDs

 Launch Spark Shell, initialize SparkContext (sc)


./bin/spark-shell
 Create a dataset
val dataset=1 to 10000
 Create the RDD: parallelize the data (returns a pointer to the RDD)
val distributedData=sc.parallelize(dataset)
 Perform additional transformations or invoke an action on it
 E.g. Select values <10
distributedData.filter(_ <10).collect()

sc points to the SparkContext, sc is internally created by Spark


RDD basic operations
 Create a RDD from an external dataset
 Load the file: creates a RDD, which is a pointer to the file
 The dataset is not loaded into memory yet
val line = sc.textFile(“hdfs://data.txt”)
val lines2=sc.textFile(“Readme.md”)

 Apply transformation - > Updates the DAG


 Map each line to its length
val lineLengths = line.map(s => s.length)

 Invoke an action:
 Get the total length of all the lines-> Applies transformation + Action
 Return value to the caller
val totalLengths = lineLengths.reduce ((a,b) => a + b)
RDD basic operations

 Example MapReduce WordCount


Split the file by words
val wordCounts = textFile.flatMap(line => line.split (“ “))

Map each word into a key,value pair with (k,v)=(word, 1)


.map(word => (word,1)) TRANSFORMATION

Reduce by the key adding up all the values of the same key
.reduceByKey((a,b) => a + b)

Print out all the words and its ocurrences


wordCounts.collect() ACTION
Example Log Mining
 Aim: Search for various patterns interactively
 Load error messages from a log file into memory
var lines = spark.textFile(“hdfs://…”)

 Transformations -will be done when invoking the action


Filter log errors
val errors = lines.filter(_.startsWith(“ERROR”))
val messages = errors.map(_.split(‘\t’)(2))
Cache the filtered dataset
messages.cache()

 Actions: Count specific error messages


1) messages.filter(_.contains(“mysql)).count()
2) messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))

messages.cache() data is partitioned into different blocks


across the cluster
messages.filter(_.contains(“mysql)).count()

messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))

messages.cache() The driver sends the code to be


executed on each block
messages.filter(_.contains(“mysql)).count()
The executor in each worker will perform
the work on each block
messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))

messages.cache() The executors read HDFS


blocks to prepare the data for
messages.filter(_.contains(“mysql)).count() operations in parallel

messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))
Cache 1 Cache 2 Cache 3
messages.cache()

messages.filter(_.contains(“mysql)).count()
transformations: split lines by tab

messages.filter(_contains(“php)).count()
A cache is created into memory
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))
Cache 1 Cache 2 Cache 3
messages.cache()

messages.filter(_.contains(“mysql)).count()
Pointer to cache data

messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))
Cache 1 Cache 2 Cache 3
messages.cache()

messages.filter(_.contains(“mysql)).count() Complete the first action,


then data are sent back to
the driver
messages.filter(_contains(“php)).count()
Example Log Mining

Driver

var lines = spark.textFile(“hdfs://…”)

val errors =

lines.filter(_.startsWith(“ERROR”)) Worker Worker Worker


Block 1 Block 2 Block 3
val messages = errors.map(_.split(‘\t’)(2))
Cache 1 Cache 2 Cache 3
messages.cache()

messages.filter(_.contains(“mysql)).count()
Use data on the cache to process
the second action
messages.filter(_contains(“php)).count()
Cache of data provides faster results:
Example Log Mining 1 TB of log data -> 5-7 sec. from cache vs. 170s. for
on-disk

 What happens when an action is executed?


Driver

var lines = spark.textFile(“hdfs://…”)

val errors = lines.filter(_.startsWith(“ERROR”))

val messages = errors.map(_.split(‘\t’)(2)) Worker Worker Worker


Block 1 Block 2 Block 3
messages.cache()
Cache 1 Cache 2 Cache 3
messages.filter(_.contains(“mysql)).count()

messages.filter(_contains(“php)).count() Send data back to the driver


RDD operations

Transformations: Actions:
Map Reduce
Filter Collect
GroupByKey Count
ReduceByKey Save
Sample LookupKey
Union ….
Join
Cache
….
RDD operations - transformations
 Transformations are lazy evaluations: Results are not computed right away
Transformation Meaning https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

map(func) Return a new distributed dataset formed by passing each element of the source through a
function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func
returns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func
should return a Seq rather than a single item)
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of
type Iterator<T> => Iterator<U> when running on an RDD of type T
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: using reduceByKey or aggregateByKey will yield much better performance.
reduceByKey(func, When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for
[numTasks]) each key are aggregated using the given reduce function func, which must be of type (V,V)
=> V.
sortByKey([ascending], When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of
[numTasks]) (K, V) pairs sorted by keys in ascending or descending order
join(otherDataset, When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with
[numTasks]) all pairs of elements for each key. Outer joins are supported with leftOuterJoin, rightOuterJoin
RDD operations - actions
 Actions return values
Action Meaning http://spark.apache.org/docs/latest/programming-guide.html#transformations

reduce(func) Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative
so that it can be computed correctly in parallel
collect() Return all the elements of the dataset as an array at the driver program. This is
usually useful after a filter or other operation that returns a sufficiently small subset
of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeOrdered(n, [orderi Return the first n elements of the RDD using either their natural order or a custom
ng]) comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given
directory in the local filesystem, HDFS or any other Hadoop-supported file system
foreach(func) Run a function func on each element of the dataset. This is usually done for side
effects such as updating an Accumulator or interacting with external storage
systems.
SparkContext

 Main entry point for Spark functionality


 Represents the connection to a Spark cluster
 It can be used to:
 Create RDDs
 Shared variables on the cluster
 In the Spark shell
 SparkContext is automatically initialized
 In a Spark program, a new SparkContext must be created
 Only one SparkContext may be active
 Any active SparkContext must be stopped before creating a new one
Sparklyr
Introduction
 Provides a light-weight front-end
to use Apache Spark from R
 The R package sparklyr allows
working with data in a Spark
cluster
 It has a dplyr interface
Similar sintax to dplyr – style in R
code
 Sparklyr supports distributed
machine learning using MLlib

https://spark.rstudio.com
R vs Sparklyr

 R used for not-large-scale data sources


 Spark – parallel computation
 For data that does not fit into memory
 When computation is too slow

https://therinspark.com/analysis.html
R vs Sparklyr

 R used for not-large-scale data sources


 Spark – parallel computation
 For data that does not fit into memory
 When computation is too slow

https://therinspark.com/analysis.html
R vs Sparklyr

 R is restricted to a single thread and the memory of a single machine


 Sparklyr is multi-threaded across many machines and has access to
the memory and all machines in the cluster
 Example: a large amount of initial data which needs cleaning,
filtering, and aggregating + modeling
 1) Preprocessing in R is nearly impossible – memory constraints
 2) Model training
 R, performance constrained by doing the work in a single machine
 Sparklyr performs the parameter tuning in parallel, and aggregates the
parameters to find the best model for the dataset
 Sparklyr improves performance by several orders of magnitude
Sparklyr -Distributed DataFrames

 DataFrames are a fundamental data structure used for data


processing in R
 Sparklyr provides a Distributed DataFrames implementation that
supports R operations like selection, filtering, aggregation, on large
data sets
 Conceptually equivalent to a table in a relational database: organized
into named columns
 They can be constructed from different data sources
Structured data files, tables in Hive, external databases, existing local R data
frames, CSV files, JSON files,…
 Sparklyr DataFrames present an API similar to package dplyr
SparkContext

 SC is the entry point into Sparklyr from the R environment


 Access all Spark functionality
 The Spark Driver program uses SparkContext to access the cluster through open-source
managers (YARN, Spark cluster manager)
 It represents the client connection to a Spark execution environment

http://spark.apache.org/docs/latest/cluster-overview.html
Spark Context in sparklyr
 The entry point into all relational functionality in Spark is the SQLContext:
 SparkSQL is a Spark module for structured data processing
 Sparklyr works solely with DataFrames, which require SparkSQL
 It translates dplyr functions and vector expressions into SQL
 Example:
 Load the package:
library(sparklyr)
 Connect to Spark:
spark_conn <- spark_connect("local")
 Disconnect from Spark:
spark_disconnect(spark_conn)
 *To see how the requested configuration affected the Spark connection, go to the
Executors page in the Spark Web UI available in http://localhost:4040/
Data manipulation in Sparklyr: DataFrames

 Use the distributed parallel capabilities offered by RDDs


 They import a schema on top of the data giving structure
 Metadata, column names, data types, etc.
 Spark DataFrame based on column access, like in R
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 <NA> Fr 840
Sue 24 66 1118 Cauc Sr NA
Jeff 20 72 202 Asian So 880
Data Manipulation based on package dplyr (I)

 When connected to a Spark DataFrame, dplyr translates the


commands into Spark SQL statements
 Examples of dplyr verbs with their corresponding SQL commands:
 select ~ SELECT
 filter ~ WHERE
 arrange ~ ORDER by
 mutate ~ operators: +, *, log, etc.
 summarise ~ aggregators: sum, min, sd, etc.
Data Manipulation based on package dplyr (II)

 Filter data:
filter(flights, month == 1, day == 1)
 Arrange rows (re-order rows):
arrange(flights, year, month, day)
 Select columns by name
select(flights, year, month, day)
 Extract distinct (unique) rows:
distinct(flights, tailnum)
 Add new columns
mutate(flights, speed = distance / air_time * 60)
 Summarize values to a single row:
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
Data Manipulation based on package dplyr (III)

 Group_by()
 It breaks down a dataset into specified groups of rows
 When a function is applied on the resulting object, it is automatically applied “by group”
 arrange() orders first by the grouping variables
 sample_n() and sample_frac() sample the specified number/fraction of rows in
each group
sample_n(flights, 10)
 slice() extracts rows within each group
 summarize() is used with aggregate functions (min, max, sum, mean, sd, median,…)
 Take a vector of values and return a single number
summarize(per_day, flights = sum(flights)))
Use of piping with dplyr

 Allows writing cleaner syntax using the magrittr library


 2 equivalent code options:
A)
c1 <- filter(flights, day == 17, month == 5, carrier %in% c('UA', 'WN',
'AA', 'DL'))
c2 <- select(c1, carrier, dep_delay, air_time, distance)
c3 <- arrange(c2, carrier)
c4 <- mutate(c3, air_time_hours = air_time / 60)

B)
c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
arrange(carrier) %>%
mutate(air_time_hours = air_time / 60)
Copy data into Spark

 Read CSV files


sp_flights <- spark_read_csv(sc,
name = "flights",
path = "data",
memory = FALSE,
columns = file_columns,
infer_schema = FALSE)
 Copy data from a dataset in R
 Slow process when using big datasets

iris_tbl <- copy_to(spark_conn, iris, overwrite = TRUE)

 See the list of all data frames stored in Spark:

src_tbls()
Machine Learning in Sparklyr
 MLlib is Spark’s machine learning library
 In Sparklyr it is provided by the spark.ml package
 Machine Learning workflows can be created and tuned together
with sparklyr’s dplyr interface
 Sparklyr provides three families of functions:
 Machine learning algorithms for analyzing data (ml_*)
 Feature transformers for manipulating individual features (ft_*)
 Functions for manipulating Spark DataFrames (sdf_*)

https://spark.rstudio.com/mlib/
Sparklyr – ML functions
Sparklyr - Transformers

 A model is often fit on some transformation of a dataset


 Feature transformers facilitate many common transformations of
data within a Spark DataFrame
 Sparklyr exposes these within the ft_* family of functions
 E.g. Take one or more input columns, and generate a new output column
formed as a transformation of those columns
Sparklyr - Transformers

https://spark.rstudio.com/packages/sparklyr/latest/reference/#spark-feature-transformers
Sparklyr - Utilities

 Functions for interacting with Spark ML model fits

https://spark.rstudio.com/packages/sparklyr/latest/reference/#spark-ml---evaluation
Example A – Data Manipulation based on dplyr

 Data from the nycflights13 R package.


 Contains data for all 336,776 flights departing New York City in 2013; metadata on airlines, airports,
weather, and planes
 Data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13
 Connect to the cluster and copy the flights data using the copy_to function:
library(sparklyr)
library(nycflights13)
library(dplyr)
sc <- spark_connect(master = "local")
flights <- copy_to(sc, nycflights13::flights, ”flights”)
airlines <- copy_to(sc, nycflights13::airlines, ”flights”)
weather <- copy_to(sc, nycflights13::weather, "weather")
planes =copy_to(sc, nycflights13::planes, "planes")
airports =copy_to(sc, nycflights13::airports, "airports")
* Remember that large datasets should not be copied directly from R objects
Example A – Data Manipulation based on dplyr

 Combining commands equivalent to SQL commands:


sql<- flights %>%
select(month, day, carrier, air_time) %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
arrange(carrier) %>%
mutate(air_time_hours = air_time / 60)

#Check Equivalent SQL


dbplyr::sql_render(sql)
Example A – Data Manipulation based on dplyr

 The group_by function corresponds to the GROUP BY statement in


SQL:
flights %>%
group_by(month) %>%
summarize(count = n(), mean_dep_delay = mean(dep_delay), sd_dep_delay =
sd(dep_delay)) %>%
arrange(month)
Example A – Data Manipulation based on dplyr

 dplyr supports Spark SQL window functions in conjunction with


mutate and filter to solve a wide range of problems:
# Rank the top three largest delays for each carrier
flights %>%
group_by(carrier) %>%
mutate(rank = rank(desc(dep_delay))) %>%
filter(rank <= 3) %>%
select(carrier, year, month, day, dep_delay, rank)
Example A – Data Manipulation based on dplyr

 Performing Joins between several tables. 3 families of verbs that


work with two tables at a time:
 Mutating joins add new variables to one table from matching rows in another
 Filtering joins filter observations from one table based on whether or not they
match an observation in the other table
 Set operations, which combine the observations in the data sets as if they were
set elements
flights %>%
left_join(airlines, by = "carrier") %>%
select(name, carrier, flight, origin, dest,
dep_delay) %>%
filter(origin == "JFK", dest == "SFO") %>%
arrange(desc(dep_delay))
Example A – Data Manipulation based on dplyr
 Laziness: When working with databases, dplyr tries to be as lazy as possible:
 It never pulls data into R unless you explicitly ask for it
 It delays doing any work until the last possible moment: it collects together everything you
want to do and then sends it to the database in one step

 Register and Cache the data


 Your can register tables with the Hive metadata store:

sdf_register(sql2, "jfk2sfo")
 Collecting to R
 You can copy data from Spark into R’s memory by using collect()
 collect() executes the Spark query and returns the results to R for further analysis and
visualization
 For example, to plot the data from the Hive data store:
ggplot(jfk2sfo, aes(month, air_time_hours, group=month)) + geom_boxplot()
Sparklyr – Example B

 Iris dataset: measures attributes for 150 flowers in 3 different species of iris

library(sparklyr)
library(ggplot2)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
Sparklyr – Example B(II)

 K-means clustering to partition a dataset into 3 groups using 2 out of 4 features

kmeans_model <- ml_kmeans(x = iris_tbl, k = 3, features =


c("Petal_Length", "Petal_Width"))

# print the model fit


print(kmeans_model)
Sparklyr – Example B (III)

 Apply the kmeans model to predict the associated class

# predict the associated class


predicted <- ml_predict(kmeans_model, iris_tbl) %>%
collect
table(predicted$Species, predicted$prediction)
Sparklyr – Example B(IV)
Sparklyr Cheatsheet

https://spark.rstudio.com/images/homepage/spar
klyr.pdf
Install sparklyr

 Install Java 8

 Select ‘New Connection’


Tutorials - Sparklyr

 Example Sparklyr in Moodle

 Spark Machine Learning Library (MLlib)


https://spark.rstudio.com/mlib/
https://spark.rstudio.com/guides/mlib.html#examples

 Text mining with Spark & Sparklyr


https://spark.rstudio.com/guides/textmining/#data-import

 Understanding Spark Caching in sparklyr


https://spark.rstudio.com/guides/caching/
References (1)

 “Big data now” Current perspectives from O’Reilly Media, 2012


 “Big data essentials”. Anil Maheshwari, 2016
 “Big Data Application in Biomedical Research and Health
Care: A Literature Review”, Jake Luo, Min Wu, Deepika
Gopukumar, and Yiqing Zhao Biomed Inform. Insights, 2016 (8),
1-10 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4720168/
 “Mining of Massive Datasets”. J. Leskovec, A. Rajaraman, J.
Ullman, Cambridge University Press 2014
 “Hadoop: The definitive guide”. Tom White. Ed. O’Reilly Media,
2015
References (2)

Hadoop Wiki
 Introduction
 http://wiki.apache.org/lucene-hadoop/
 Getting Started
 http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
 Map/Reduce Overview
 http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
 http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses
References (3)
 “Learning Spark” Lightning-Fast Big Data Analysis. Holden Karau,
Andy Konwinski, Patrick Wendell, Matei Zaharia, O’Reilly Media, 2015
 “Mastering Spark with R”, Javier Luraschi, Kevin Kuo, Edgar Ruiz, 2021
 http://therinspark.com
 Apache Spark Tutorial
 https://www.tutorialspoint.com/apache_spark/index.htm
 Apache Spark References:
 http://spark.apache.org/docs/latest/

You might also like