Machine Learning With Spark

Machine
Learning with
Spark
Jesús Montes
jesus.montes@upm.es
Nov. 2022
Machine Learning with Spark
What is Machine Learning? (just in case…)
● Machine Learning
○ Related to/part of the AI field and Data Science
○ Providing machines with new, advanced capabilities
● Examples:
○ Database analysis
○ Autonomous driving
○ Handwriting recognition
○ Natural language processing
○ Product recommendation
● Study and analysis of the human learning process
Machine Learning with Spark 2

What is Machine Learning? (just in case…)
● The subfield of computer science that "gives computers the ability to
learn without being explicitly programmed" (Arthur Samuel, 1959).
● A more formal definition: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves
with experience E." (Tom M. Mitchell, 1998).
○ Experience (E)
○ Task (T)
○ Performance (P)
● Can you identify E, T and P in a typical automated SPAM filter problem?

Typical input data in Machine Learning
Target/label
Features
(optional)
Feature 1 ... Feature m Target
x11 ... x1m y1

Feature
... ... ...
Vectors
xn1 ... xnm yn
Values

Machine Learning techniques
● Supervised learning
● Unsupervised learning
● Other:
○ Semi-supervised learning
○ Reinforcement learning
○ Recommender systems
○ ...

Supervised Learning
Extract knowledge from: Types of supervised learning:
● Initial data ● Regression

● Target variable. True values for this ○ Continuous target variable
variable are known for the initial data ○ Examples: stock market values,
temperature prediction…
○ Techniques: Generalized Linear Models,
Once the learning process takes place the
Artificial Neural Networks...
system will be able to predict the expected
● (supervised) Classification
value of the target variable (more or less ○ Discrete/categorical target variable
accurately). ○ Examples: Disease diagnosis, error
detection in production systems…
○ Techniques: Logistic Regression,
Decision Trees, Naïve-Bayes...

Unsupervised Learning
Extract knowledge from: Typical techniques: K-Means, Expectation
Maximization, Hierarchical clustering,
● Initial data density-based clustering...
● No target variable
The learning result is new knowledge about

the organization of the information contained
in the initial data.
Most typical case is clustering

(a.k.a. unsupervised classification).

Evaluating the learning process
"A computer program is said to learn from Regression:
experience E with respect to some class of
tasks T and performance measure P if its ● Mean squared error
performance at tasks in T, as measured by P, ● Coefficient of determination (R2)
improves with experience E." (Tom M.
Classification:
Mitchell, 1998).
● Correctly classified percentage
To evaluate T we need to define the
● True pos/neg + False pos/neg
performance measure P.
● Precision + Recall
Ideally, P should be a numerical
Clustering:
measurement that allows objective evaluation
of the learning process.
● Level of agreement (Rand index…)

Learning and evaluation
High bias OK High variance
Price Price Price
Surface Surface Surface
● The model is too ● Prediction within ● The model is too

simple acceptable margins complex
● High prediction error ● High prediction error
● Underfitting ● Overfitting

Learning and evaluation
Typically, the performance measure is also We need (at least) two datasets with the same
used during the learning process, to guide it. structure:
● At the end of the learning process, we ● Training set: Used to train the model.
obtain a performance measurement for ● Test set: Used to evaluate the model.
the data used during the training
(training dataset). The test set is only used at the end, and the
resulting value of the performance measure is
WARNING: overfitting the final performance of our model.
● The model created could only be able to The test set must never be used for training.
correctly predict the values it has
already “seen” (the training dataset).

Machine Learning with Spark
Data Science and Big Data are strongly related fields.
So far, we have learned how Spark can be used to load and process data, but
Spark can be used for many steps of a Data Science project:
● Data load
● Processing/cleaning
● Model training
● Model evaluation
The Spark component that provides Machine Learning tools is called MLlib.

Machine Learning in the Spark Stack

MLlib (Machine Learning Library)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical
machine learning scalable and easy. At a high level, it provides tools such as:
● ML Algorithms: common learning algorithms such as classification,

regression, clustering, and collaborative filtering.
● Featurization: feature extraction, transformation, dimensionality
reduction, and selection.
● Pipelines: tools for constructing, evaluating, and tuning ML Pipelines.
● Persistence: saving and load algorithms, models, and Pipelines.
● Utilities: linear algebra, statistics, data handling, etc.

MLlib API
● All tools in MLlib can be accessed through the MLlib API
○ Available in Scala, Java and Python
○ Limited functionality in R, but growing
● The MLlib API is DataFrame-based
○ All algorithms, transformations and other operations take DataFrames as input and
(usually) produce DataFrames as output.
● From earlier versions of Spark, it still exists an RDD-based MLlib API
○ This API is currently in “maintenance mode”, which means that bugs are still being fixed,
but no new functionalities are added.
○ The main focus is now the DataFrame-based API
○ The DataFrame-based API is recommended

Spark Applications using MLlib
Add the following dependencies (SBT):
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.0",
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.3.0",

MLlib API components
The MLlib API is based on five main components:
● DataFrame: This ML API uses DataFrame from Spark SQL.

● Transformer: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame.
● Estimator: An Estimator is an algorithm which can be fit on a DataFrame
to produce a Transformer.
● Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
● Parameter: All Transformers and Estimators now share a common API
for specifying parameters.
Transformer
● A Transformer is an abstraction that includes feature transformers and
learned models.
● A Transformer implements a method transform(), which converts one
DataFrame into another, generally by appending one or more columns.
● For example:
○ A feature transformer might take a DataFrame, read a column (e.g., text), map it into a
new column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.
○ A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted
labels appended as a column.

Examples of Transformers
import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.feature.Normalizer
import spark.implicits._ val normalizer = new Normalizer()

.setInputCol("features")
val df = Seq((-3.0965012, 5.2371198,-0.7370271), .setOutputCol("normFeatures")
(-0.2100299,-0.7810844,-1.3284768), .setP(1.0)
(8.3525083, 5.3337562, 21.8897181),
(-3.0380369, 6.5357180, 0.3469820), val l1NormData = normalizer.transform(output)
(5.9354651, 6.0223208, 17.9566144), l1NormData.show(truncate=false)
(-6.8357707, 5.6629804,-8.1598308),
(8.8919844, -2.5149762, 15.3622538), The first example (left) creates a vector of
(6.3404984, 4.1778706, 16.7931822))
.toDF("x1","x2","y") features from each row of the DataFrame.
val assembler = new VectorAssembler()
This vector can be later used for training a
.setInputCols(Array("x1", "x2")) regression model.
.setOutputCol("features")
val output = assembler.transform(df) The second example (right) normalizes the

output.show(truncate=false)
vector of features.

Estimator
● An Estimator abstracts the concept of a learning algorithm or any
algorithm that fits or trains on data.
● An Estimator implements a method fit(), which accepts a DataFrame
and produces a Model, which is a Transformer.
● For example:
○ A learning algorithm such as LogisticRegression is an Estimator, and calling fit()
trains a LogisticRegressionModel , which is a Model and hence a Transformer.

Estimator example 1: LinearRegression
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression()

.setFeaturesCol("features")
.setLabelCol("y")
.setMaxIter(10)
.setElasticNetParam(0.8)
val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")
val trainingSummary = lrModel.summary

println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory:
${trainingSummary.objectiveHistory.toList}")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

Estimator example 2: K-Means
val df2 = Seq( import org.apache.spark.ml.feature.VectorAssembler
(1.60193653,-1.8679101), import org.apache.spark.ml.clustering.KMeans
(-1.1328963,-1.9607465),
(2.40675869,-1.5994823), val assembler = new VectorAssembler()
(0.09330145,-2.9446696), .setInputCols(Array("x1", "x2"))
(1.3795901,-2.5489864), .setOutputCol("features")
(-0.42065496,-2.8165693), val dataset = assembler
(0.55753398,-2.0145494), .transform(df2)
(1.3066549,-2.3208153),
(0.66224722,-0.9406476), val kmeans = new KMeans()
(1.19072851,-2.4178092), .setFeaturesCol("features")
(4.67961769,1.6375689), .setK(2)
(5.03015133,1.4575724), val model = kmeans.fit(dataset)
(6.1003413,2.1673923),
(4.20259176,1.8237144), println("Cluster Centers: ")
(4.93339445,1.8983999), model.clusterCenters
(6.70975052,1.5899655), .foreach(println)
(5.01461979,1.6051478),
(5.00005277,1.6855351),
(4.12926186,2.0582398))
.toDF("x1","x2")

Pipeline
● In machine learning, it is common to run a sequence of algorithms to
process and learn from data.
● Example: A simple text document processing workflow might include
several stages:
○ Split each document’s text into words.
○ Convert each document’s words into a numerical feature vector.
○ Learn a prediction model using the feature vectors and labels.
● MLlib represents such a workflow as a Pipeline, which consists of a
sequence of PipelineStages (Transformers and Estimators) to be run in a
specific order.

Pipeline
● A Pipeline is specified as a sequence of stages (Transformer or Estimator).
● Stages are run in order.
● The input DataFrame is transformed as it passes through each stage.
○ For Transformer stages, the transform() method is called on the DataFrame.
○ For Estimator stages, the fit() method is called to produce a Transformer, and that
Transformer’s transform() method is called on the DataFrame.

Pipeline
● A Pipeline is an Estimator.
● After a Pipeline’s fit() method runs, it produces a PipelineModel, which
is a Transformer.
● This PipelineModel is used at test time.

Pipeline Example
70% for training, 30% for test
import org.apache.spark.ml.Pipeline
val split = df2.randomSplit(Array(0.7,0.3))

val training = split(0) Split the data in training and test
val test = split(1)
val assembler = new VectorAssembler()

.setInputCols(Array("x1", "x2"))
.setOutputCol("features") Create the pipeline stages
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)
val pipeline = new Pipeline() Create the pipeline

.setStages(Array(assembler, kmeans))
val model = pipeline.fit(training) Train and use the pipeline

model.transform(test).show(truncate=false)

MLlib documentation
● The list of Transformers and Estimators available in MLlib grows with each
Spark release.
● Keep the Spark Programming guides always at hand. They are extremely
valuable when learning new models and techniques.
● http://spark.apache.org/docs/latest/ml-guide.html

Machine Learning With Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning With Spark

Uploaded by

Copyright:

Available Formats

Machine

Machine Learning with Spark 2

Machine Learning with Spark 3

Feature 1 ... Feature m Target

x11 ... x1m y1

Machine Learning with Spark 4

Machine Learning with Spark 5

● Initial data ● Regression

Machine Learning with Spark 6

The learning result is new knowledge about

Most typical case is clustering

Machine Learning with Spark 7

Machine Learning with Spark 8

Surface Surface Surface

● The model is too ● Prediction within ● The model is too

Machine Learning with Spark 9

Machine Learning with Spark 10

Machine Learning with Spark 11

Machine Learning with Spark 12

● ML Algorithms: common learning algorithms such as classiﬁcation,

Machine Learning with Spark 13

Machine Learning with Spark 14

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.0",

libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.3.0",

Machine Learning with Spark 15

● DataFrame: This ML API uses DataFrame from Spark SQL.

Machine Learning with Spark 17

import spark.implicits._ val normalizer = new Normalizer()

val output = assembler.transform(df) The second example (right) normalizes the

Machine Learning with Spark 18

Machine Learning with Spark 19

val lr = new LinearRegression()

val lrModel = lr.fit(output)

val trainingSummary = lrModel.summary

Machine Learning with Spark 20

Machine Learning with Spark 21

Machine Learning with Spark 22

Machine Learning with Spark 23

Machine Learning with Spark 24

val split = df2.randomSplit(Array(0.7,0.3))

val assembler = new VectorAssembler()

val pipeline = new Pipeline() Create the pipeline

val model = pipeline.fit(training) Train and use the pipeline

Machine Learning with Spark 25

Machine Learning with Spark 26

You might also like