Data Science in Spark With Sparklyr::: Cheat Sheet

Data Science in Spark with Sparklyr : : CHEAT SHEET
Intro Data Science Toolchain with Spark + sparklyr Using

sparklyr is an R interface for Apache Spark™,
it provides a complete dplyr backend and the Import Tidy
Understand
Communicate
sparklyr
Transform Visualize
option to query directly using Spark SQL • Export an R • dplyr verb Transformer Collect data into • Collect data A brief example of a data analysis using
statement. With sparklyr, you can orchestrate DataFrame • Direct Spark function R for plotting into R Apache Spark, R and sparklyr in local mode
distributed machine learning using either • Read a file SQL (DBI) • Share plots,
Spark’s MLlib or H2O Sparkling Water. • Read existing • SDF function Wrangle Model documents, library(sparklyr); library(dplyr); library(ggplot2);
Hive table (Scala API) • Spark MLlib and apps library(tidyr);
Starting with version 1.044, RStudio Desktop, Install Spark locally
R for Data Science, Grolemund & Wickham
• H2O Extension set.seed(100)
Server and Pro include integrated support for
the sparklyr package. You can create and spark_install("2.0.1") Connect to local version
manage connections to Spark clusters and local Getting Started sc <- spark_connect(master = "local")
Spark instances from inside the IDE.
LOCAL MODE (No cluster required) ON A YARN MANAGED CLUSTER
RStudio Integrates with sparklyr
1. Install a local version of Spark: 1. Install RStudio Server or RStudio Pro on import_iris <- copy_to(sc, iris, "spark_iris",
Open connection log Disconnect spark_install ("2.0.1") one of the existing nodes, preferably an overwrite = TRUE)
2. Open a connection edge node Copy data to Spark memory
sc <- spark_connect (master = "local") 2. Locate path to the cluster’s Spark Home
Directory, it normally is “/usr/lib/spark” partition_iris <- sdf_partition( Partition
import_iris,training=0.5, testing=0.5) data
3. Open a connection
Open the ON A MESOS MANAGED CLUSTER
Spark UI spark_connect(master=“yarn-client”,
version = “1.6.2”, spark_home = sdf_register(partition_iris,
1. Install RStudio Server or Pro on one of the
existing nodes [Cluster’s Spark path]) c("spark_iris_training","spark_iris_test"))
Preview
Spark & Hive Tables 1K rows
2. Locate path to the cluster’s Spark directory
Create a hive metadata for each partition
3. Open a connection
Cluster Deployment spark_connect(master=“[mesos URL]”,
version = “1.6.2”, spark_home = ON A SPARK STANDALONE CLUSTER
tidy_iris <- tbl(sc,"spark_iris_training") %>%
select(Species, Petal_Length, Petal_Width)
[Cluster’s Spark path]) 1. Install RStudio Server or RStudio Pro on
MANAGED CLUSTER Spark ML
Worker Nodes one of the existing nodes or a server in the
Cluster Manager Decision Tree
same LAN Model
Driver Node model_iris <- tidy_iris %>%
USING LIVY (Experimental) 2. Install a local version of Spark:
fd 1. The Livy REST application should be spark_install (version = “2.0.1")
ml_decision_tree(response="Species",
features=c("Petal_Length","Petal_Width"))
YARN running on the cluster 3. Open a connection
fd or
Mesos 2. Connect to the cluster spark_connect(master=“spark:// test_iris <- tbl(sc,"spark_iris_test") Create
sc <- spark_connect(method = "livy", host:port“, version = "2.0.1", reference to
fd master = "http://host:port") spark_home = spark_home_dir()) pred_iris <- sdf_predict(
Spark table
model_iris, test_iris) %>%
Bring data back
STAND ALONE CLUSTER Worker Nodes
Tuning Spark collect
into R memory
for plotting
Driver Node pred_iris %>%
fd EXAMPLE CONFIGURATION IMPORTANT TUNING PARAMETERS with defaults inner_join(data.frame(prediction=0:2,
lab=model_iris$model.parameters$labels)) %>%
fd config <- spark_config()
config$spark.executor.cores <- 2
•
•
spark.yarn.am.cores • spark.executor.instances
spark.yarn.am.memory 512m • spark.executor.extraJavaOptions ggplot(aes(Petal_Length, Petal_Width, col=lab)) +
config$spark.executor.memory <- "4G" • spark.network.timeout 120s • spark.executor.heartbeatInterval 10s geom_point()
fd sc <- spark_connect (master="yarn-client", • spark.executor.memory 1g • sparklyr.shell.executor-memory
config = config, version = "2.0.1") • spark.executor.cores 1 • sparklyr.shell.driver-memory spark_disconnect(sc) Disconnect
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 0.5 • Updated: 2016-12
Reactivity Visualize & Communicate Model (MLlib)
COPY A DATA FRAME INTO SPARK SPARK SQL COMMANDS DOWNLOAD DATA TO R MEMORY ml_decision_tree(my_table,
sdf_copy_to(sc, iris, "spark_iris") r_table <- collect(my_table) response = “Species", features =
DBI::dbWriteTable(sc, "spark_iris", iris)
plot(Petal_Width~Petal_Length, data=r_table)
sdf_copy_to(sc, x, name, memory, repartition, c(“Petal_Length" , "Petal_Width"))
DBI::dbWriteTable(conn, name, dplyr::collect(x)
overwrite) value) Download a Spark DataFrame to an R DataFrame ml_als_factorization(x, user.column = "user",
sdf_read_column(x, column) rating.column = "rating", item.column = "item",
IMPORT INTO SPARK FROM A FILE FROM A TABLE IN HIVE Returns contents of a single column to R rank = 10L, regularization.parameter = 0.1, iter.max = 10L,
Arguments that apply to all functions: my_var <- tbl_cache(sc, name= ml.options = ml_options())
sc, name, path, options = list(), repartition = 0, "hive_iris") SAVE FROM SPARK TO FILE SYSTEM ml_decision_tree(x, response, features, max.bins = 32L, max.depth
memory = TRUE, overwrite = TRUE Arguments that apply to all functions: x, path = 5L, type = c("auto", "regression", "classification"), ml.options =
tbl_cache(sc, name, force = TRUE)
CSV spark_read_csv( header = TRUE, Loads the table into memory spark_read_csv( header = TRUE, ml_options()) Same options for: ml_gradient_boosted_trees
columns = NULL, infer_schema = TRUE, CSV
delimiter = ",", quote = "\"", escape = "\\", ml_generalized_linear_regression(x, response, features,
delimiter = ",", quote = "\"", escape = "\\", my_var <- dplyr::tbl(sc,
charset = "UTF-8", null_value = NULL) intercept = TRUE, family = gaussian(link = "identity"), iter.max =
charset = "UTF-8", null_value = NULL) name= "hive_iris")
dplyr::tbl(scr, …) JSON spark_read_json(mode = NULL) 100L, ml.options = ml_options())
JSON spark_read_json()
Creates a reference to the table PARQUET spark_read_parquet(mode = NULL) ml_kmeans(x, centers, iter.max = 100, features = dplyr::tbl_vars(x),
PARQUET spark_read_parquet() without loading it into memory compute.cost = TRUE, tolerance = 1e-04, ml.options = ml_options())
ml_lda(x, features = dplyr::tbl_vars(x), k = length(features), alpha =
Wrangle Reading & Writing from Apache Spark (50/k) + 1, beta = 0.1 + 1, ml.options = ml_options())
ml_linear_regression(x, response, features, intercept = TRUE,
SPARK SQL VIA DPLYR VERBS ML TRANSFORMERS tbl_cache
sdf_copy_to alpha = 0, lambda = 0, iter.max = 100L, ml.options = ml_options())
Translates into Spark SQL statements ft_binarizer(my_table,input.col=“Petal_Le dplyr::tbl
dplyr::copy_to Same options for: ml_logistic_regression
ngth”, output.col="petal_large", DBI::dbWriteTable
my_table <- my_var %>% ml_multilayer_perceptron(x, response, features, layers, iter.max =
threshold=1.2)
filter(Species=="setosa") %>% 100, seed = sample(.Machine$integer.max, 1), ml.options =
sample_n(10) Arguments that apply to all functions: ml_options())
x, input.col = NULL, output.col = NULL spark_read_<fmt>
sdf_collect ml_naive_bayes(x, response, features, lambda = 0, ml.options =
DIRECT SPARK SQL COMMANDS dplyr::collect File
ft_binarizer(threshold = 0.5) ml_options())
Assigned values based on threshold sdf_read_column System
my_table <- DBI::dbGetQuery( sc , ”SELECT * ml_one_vs_rest(x, classifier, response, features, ml.options =
spark_write_<fmt>
FROM iris LIMIT 10") ft_bucketizer(splits) ml_options())
Extensions
DBI::dbGetQuery(conn, statement) Numeric column to discretized column
ml_pca(x, features = dplyr::tbl_vars(x), ml.options = ml_options())
ft_discrete_cosine_transform(inverse
Create an R package that calls the full Spark API & ml_random_forest(x, response, features, max.bins = 32L,
SCALA API VIA SDF FUNCTIONS = FALSE)
provide interfaces to Spark packages. max.depth = 5L, num.trees = 20L, type = c("auto", "regression",
Time domain to frequency domain
sdf_mutate(.data) CORE TYPES "classification"), ml.options = ml_options())
Works like dplyr mutate function ft_elementwise_product(scaling.col)
spark_connection() Connection between R and the ml_survival_regression(x, response, features, intercept =
Element-wise product between 2 cols
sdf_partition(x, ..., weights = NULL, seed = Spark shell process TRUE,censor = "censor", iter.max = 100L, ml.options = ml_options())
sample (.Machine$integer.max, 1)) ft_index_to_string() spark_jobj() Instance of a remote Spark object
Index labels back to label as strings ml_binary_classification_eval(predicted_tbl_spark, label, score,
sdf_partition(x, training = 0.5, test = 0.5) spark_dataframe() Instance of a remote Spark metric = "areaUnderROC")
sdf_register(x, name = NULL) ft_one_hot_encoder() DataFrame object
Continuous to binary vectors ml_classification_eval(predicted_tbl_spark, label, predicted_lbl,
Gives a Spark DataFrame a table name CALL SPARK FROM R metric = "f1")
sdf_sample(x, fraction = 1, replacement = ft_quantile_discretizer(n.buckets=5L) invoke() Call a method on a Java object
Continuous to binned categorical ml_tree_feature_importance(sc, model)
TRUE, seed = NULL) invoke_new() Create a new object by invoking a
values
sdf_sort(x, columns) constructor
Sorts by >=1 columns in ascending order ft_sql_transformer(sql) invoke_static() Call a static method on an object sparklyr
sdf_with_unique_id(x, id = "id") ft_string_indexer( params = NULL) is an R
Column of labels into a column of label MACHINE LEARNING EXTENSIONS
sdf_predict(object, newdata) indices. ml_options() interface
ml_create_dummy_variables()
Spark DataFrame with predicted values for
ft_vector_assembler() ml_model()
ml_prepare_dataframe()
Combine vectors into single row-vector
ml_prepare_response_features_intercept()
RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more at spark.rstudio.com • sparklyr 0.5 • Updated: 2016-12

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science in Spark With Sparklyr::: Cheat Sheet

Uploaded by

Copyright:

Available Formats

Data Science in Spark with Sparklyr : : CHEAT SHEET

Intro Data Science Toolchain with Spark + sparklyr Using

You might also like