You are on page 1of 1

SPARK-16424), but as work is still actively progressing on streaming Datasets, it is

difficult to know when streaming ML will be available.


It is important to keep the future in mind when deciding between MLlib and ML.
New features will continue to be developed for Spark ML, which will not be back-
ported to Spark MLlib as it is in a bug-fix only stage.
Spark ML’s integrated pipeline API makes it easier to implement meta-algorithms,
like parameter search over different components. Both APIs support regression, clas‐
sification, and clustering algorithms. If you’re on the fence for your project, choosing
Spark ML is a reasonable default to pick as it is the primary actively developed
machine learning library for Spark going forward.

Working with MLlib


Many of the same performance considerations in working with Spark Core also
directly apply to working with MLlib. One of the most direct ones is with RDD reuse;
many machine learning algorithms depend on iterative computation or optimization,
so ensuring your inputs are persisted at the right level can make a huge difference.
Supervised algorithms in the Spark MLlib API are trained on RDDs of labeled points,
with unsupervised algorithms using RDDs of vectors. These labeled points and vec‐
tors are unique to the MLlib library, and separate from both Scala’s vector class and
Spark ML’s equivalent classes.

Getting Started with MLlib (Organization and Imports)


You can include MLlib in the same way as other Spark components. Its inclusion can
be simplified using the steps discussed in “Managing Spark Dependencies” on page
31. The Maven coordinates for Spark 2.1’s MLlib are org.apache.spark:spark-
mllib_2.11:2.1.0.
The imports for MLlib are a little scattered compared to the other Spark components
—see the imports used to train a simple classification model in Example 9-1.

Example 9-1. Sample MLlib imports for building a LogisticRegression model


import com.github.fommil.netlib.BLAS.{getInstance => blas}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS,
LogisticRegressionModel}
// Rename Vector to SparkVector to avoid conflicts with Scala's Vector class
import org.apache.spark.mllib.linalg.{Vector => SparkVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.feature._

220 | Chapter 9: Spark MLlib and ML

You might also like