You are on page 1of 26

Machine

Learning with
Spark
Jesús Montes
jesus.montes@upm.es

Nov. 2022
Machine Learning with Spark
What is Machine Learning? (just in case…)
● Machine Learning
○ Related to/part of the AI field and Data Science
○ Providing machines with new, advanced capabilities
● Examples:
○ Database analysis
○ Autonomous driving
○ Handwriting recognition
○ Natural language processing
○ Product recommendation
● Study and analysis of the human learning process

Machine Learning with Spark 2


What is Machine Learning? (just in case…)
● The subfield of computer science that "gives computers the ability to
learn without being explicitly programmed" (Arthur Samuel, 1959).
● A more formal definition: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves
with experience E." (Tom M. Mitchell, 1998).
○ Experience (E)
○ Task (T)
○ Performance (P)
● Can you identify E, T and P in a typical automated SPAM filter problem?

Machine Learning with Spark 3


Typical input data in Machine Learning
Target/label
Features
(optional)

Feature 1 ... Feature m Target

x11 ... x1m y1


Feature
... ... ...
Vectors
xn1 ... xnm yn

Values

Machine Learning with Spark 4


Machine Learning techniques
● Supervised learning
● Unsupervised learning
● Other:
○ Semi-supervised learning
○ Reinforcement learning
○ Recommender systems
○ ...

Machine Learning with Spark 5


Supervised Learning
Extract knowledge from: Types of supervised learning:

● Initial data ● Regression


● Target variable. True values for this ○ Continuous target variable
variable are known for the initial data ○ Examples: stock market values,
temperature prediction…
○ Techniques: Generalized Linear Models,
Once the learning process takes place the
Artificial Neural Networks...
system will be able to predict the expected
● (supervised) Classification
value of the target variable (more or less ○ Discrete/categorical target variable
accurately). ○ Examples: Disease diagnosis, error
detection in production systems…
○ Techniques: Logistic Regression,
Decision Trees, Naïve-Bayes...

Machine Learning with Spark 6


Unsupervised Learning
Extract knowledge from: Typical techniques: K-Means, Expectation
Maximization, Hierarchical clustering,
● Initial data density-based clustering...
● No target variable

The learning result is new knowledge about


the organization of the information contained
in the initial data.

Most typical case is clustering


(a.k.a. unsupervised classification).

Machine Learning with Spark 7


Evaluating the learning process
"A computer program is said to learn from Regression:
experience E with respect to some class of
tasks T and performance measure P if its ● Mean squared error
performance at tasks in T, as measured by P, ● Coefficient of determination (R2)
improves with experience E." (Tom M.
Classification:
Mitchell, 1998).
● Correctly classified percentage
To evaluate T we need to define the
● True pos/neg + False pos/neg
performance measure P.
● Precision + Recall
Ideally, P should be a numerical
Clustering:
measurement that allows objective evaluation
of the learning process.
● Level of agreement (Rand index…)

Machine Learning with Spark 8


Learning and evaluation
High bias OK High variance
Price Price Price

Surface Surface Surface

● The model is too ● Prediction within ● The model is too


simple acceptable margins complex
● High prediction error ● High prediction error
● Underfitting ● Overfitting

Machine Learning with Spark 9


Learning and evaluation
Typically, the performance measure is also We need (at least) two datasets with the same
used during the learning process, to guide it. structure:

● At the end of the learning process, we ● Training set: Used to train the model.
obtain a performance measurement for ● Test set: Used to evaluate the model.
the data used during the training
(training dataset). The test set is only used at the end, and the
resulting value of the performance measure is
WARNING: overfitting the final performance of our model.

● The model created could only be able to The test set must never be used for training.
correctly predict the values it has
already “seen” (the training dataset).

Machine Learning with Spark 10


Machine Learning with Spark
Data Science and Big Data are strongly related fields.

So far, we have learned how Spark can be used to load and process data, but
Spark can be used for many steps of a Data Science project:

● Data load
● Processing/cleaning
● Model training
● Model evaluation

The Spark component that provides Machine Learning tools is called MLlib.

Machine Learning with Spark 11


Machine Learning in the Spark Stack

Machine Learning with Spark 12


MLlib (Machine Learning Library)
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical
machine learning scalable and easy. At a high level, it provides tools such as:

● ML Algorithms: common learning algorithms such as classification,


regression, clustering, and collaborative filtering.
● Featurization: feature extraction, transformation, dimensionality
reduction, and selection.
● Pipelines: tools for constructing, evaluating, and tuning ML Pipelines.
● Persistence: saving and load algorithms, models, and Pipelines.
● Utilities: linear algebra, statistics, data handling, etc.

Machine Learning with Spark 13


MLlib API
● All tools in MLlib can be accessed through the MLlib API
○ Available in Scala, Java and Python
○ Limited functionality in R, but growing
● The MLlib API is DataFrame-based
○ All algorithms, transformations and other operations take DataFrames as input and
(usually) produce DataFrames as output.
● From earlier versions of Spark, it still exists an RDD-based MLlib API
○ This API is currently in “maintenance mode”, which means that bugs are still being fixed,
but no new functionalities are added.
○ The main focus is now the DataFrame-based API
○ The DataFrame-based API is recommended

Machine Learning with Spark 14


Spark Applications using MLlib
Add the following dependencies (SBT):

libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0",

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.0",

libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.3.0",

Machine Learning with Spark 15


MLlib API components
The MLlib API is based on five main components:

● DataFrame: This ML API uses DataFrame from Spark SQL.


● Transformer: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame.
● Estimator: An Estimator is an algorithm which can be fit on a DataFrame
to produce a Transformer.
● Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
● Parameter: All Transformers and Estimators now share a common API
for specifying parameters.
Machine Learning with Spark 16
Transformer
● A Transformer is an abstraction that includes feature transformers and
learned models.
● A Transformer implements a method transform(), which converts one
DataFrame into another, generally by appending one or more columns.
● For example:
○ A feature transformer might take a DataFrame, read a column (e.g., text), map it into a
new column (e.g., feature vectors), and output a new DataFrame with the mapped column
appended.
○ A learning model might take a DataFrame, read the column containing feature vectors,
predict the label for each feature vector, and output a new DataFrame with predicted
labels appended as a column.

Machine Learning with Spark 17


Examples of Transformers
import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.feature.Normalizer

import spark.implicits._ val normalizer = new Normalizer()


.setInputCol("features")
val df = Seq((-3.0965012, 5.2371198,-0.7370271), .setOutputCol("normFeatures")
(-0.2100299,-0.7810844,-1.3284768), .setP(1.0)
(8.3525083, 5.3337562, 21.8897181),
(-3.0380369, 6.5357180, 0.3469820), val l1NormData = normalizer.transform(output)
(5.9354651, 6.0223208, 17.9566144), l1NormData.show(truncate=false)
(-6.8357707, 5.6629804,-8.1598308),
(8.8919844, -2.5149762, 15.3622538), The first example (left) creates a vector of
(6.3404984, 4.1778706, 16.7931822))
.toDF("x1","x2","y") features from each row of the DataFrame.
val assembler = new VectorAssembler()
This vector can be later used for training a
.setInputCols(Array("x1", "x2")) regression model.
.setOutputCol("features")

val output = assembler.transform(df) The second example (right) normalizes the


output.show(truncate=false)
vector of features.

Machine Learning with Spark 18


Estimator
● An Estimator abstracts the concept of a learning algorithm or any
algorithm that fits or trains on data.
● An Estimator implements a method fit(), which accepts a DataFrame
and produces a Model, which is a Transformer.
● For example:
○ A learning algorithm such as LogisticRegression is an Estimator, and calling fit()
trains a LogisticRegressionModel , which is a Model and hence a Transformer.

Machine Learning with Spark 19


Estimator example 1: LinearRegression
import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression()


.setFeaturesCol("features")
.setLabelCol("y")
.setMaxIter(10)
.setElasticNetParam(0.8)

val lrModel = lr.fit(output)

println(s"Coefficients: ${lrModel.coefficients}")
println(s"Intercept: ${lrModel.intercept}")

val trainingSummary = lrModel.summary


println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory:
${trainingSummary.objectiveHistory.toList}")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")

Machine Learning with Spark 20


Estimator example 2: K-Means
val df2 = Seq( import org.apache.spark.ml.feature.VectorAssembler
(1.60193653,-1.8679101), import org.apache.spark.ml.clustering.KMeans
(-1.1328963,-1.9607465),
(2.40675869,-1.5994823), val assembler = new VectorAssembler()
(0.09330145,-2.9446696), .setInputCols(Array("x1", "x2"))
(1.3795901,-2.5489864), .setOutputCol("features")
(-0.42065496,-2.8165693), val dataset = assembler
(0.55753398,-2.0145494), .transform(df2)
(1.3066549,-2.3208153),
(0.66224722,-0.9406476), val kmeans = new KMeans()
(1.19072851,-2.4178092), .setFeaturesCol("features")
(4.67961769,1.6375689), .setK(2)
(5.03015133,1.4575724), val model = kmeans.fit(dataset)
(6.1003413,2.1673923),
(4.20259176,1.8237144), println("Cluster Centers: ")
(4.93339445,1.8983999), model.clusterCenters
(6.70975052,1.5899655), .foreach(println)
(5.01461979,1.6051478),
(5.00005277,1.6855351),
(4.12926186,2.0582398))
.toDF("x1","x2")

Machine Learning with Spark 21


Pipeline
● In machine learning, it is common to run a sequence of algorithms to
process and learn from data.
● Example: A simple text document processing workflow might include
several stages:
○ Split each document’s text into words.
○ Convert each document’s words into a numerical feature vector.
○ Learn a prediction model using the feature vectors and labels.
● MLlib represents such a workflow as a Pipeline, which consists of a
sequence of PipelineStages (Transformers and Estimators) to be run in a
specific order.

Machine Learning with Spark 22


Pipeline
● A Pipeline is specified as a sequence of stages (Transformer or Estimator).
● Stages are run in order.
● The input DataFrame is transformed as it passes through each stage.
○ For Transformer stages, the transform() method is called on the DataFrame.
○ For Estimator stages, the fit() method is called to produce a Transformer, and that
Transformer’s transform() method is called on the DataFrame.

Machine Learning with Spark 23


Pipeline
● A Pipeline is an Estimator.
● After a Pipeline’s fit() method runs, it produces a PipelineModel, which
is a Transformer.
● This PipelineModel is used at test time.

Machine Learning with Spark 24


Pipeline Example
70% for training, 30% for test
import org.apache.spark.ml.Pipeline

val split = df2.randomSplit(Array(0.7,0.3))


val training = split(0) Split the data in training and test
val test = split(1)

val assembler = new VectorAssembler()


.setInputCols(Array("x1", "x2"))
.setOutputCol("features") Create the pipeline stages
val kmeans = new KMeans()
.setFeaturesCol("features")
.setK(2)

val pipeline = new Pipeline() Create the pipeline


.setStages(Array(assembler, kmeans))

val model = pipeline.fit(training) Train and use the pipeline


model.transform(test).show(truncate=false)

Machine Learning with Spark 25


MLlib documentation
● The list of Transformers and Estimators available in MLlib grows with each
Spark release.
● Keep the Spark Programming guides always at hand. They are extremely
valuable when learning new models and techniques.
● http://spark.apache.org/docs/latest/ml-guide.html

Machine Learning with Spark 26

You might also like