SPARK-16424), but as work is still actively progressing on streaming Datasets, it is
difficult to know when streaming ML will be available.
It is important to keep the future in mind when deciding between MLlib and ML. New features will continue to be developed for Spark ML, which will not be back- ported to Spark MLlib as it is in a bug-fix only stage. Spark ML’s integrated pipeline API makes it easier to implement meta-algorithms, like parameter search over different components. Both APIs support regression, clas‐ sification, and clustering algorithms. If you’re on the fence for your project, choosing Spark ML is a reasonable default to pick as it is the primary actively developed machine learning library for Spark going forward.
Working with MLlib
Many of the same performance considerations in working with Spark Core also directly apply to working with MLlib. One of the most direct ones is with RDD reuse; many machine learning algorithms depend on iterative computation or optimization, so ensuring your inputs are persisted at the right level can make a huge difference. Supervised algorithms in the Spark MLlib API are trained on RDDs of labeled points, with unsupervised algorithms using RDDs of vectors. These labeled points and vec‐ tors are unique to the MLlib library, and separate from both Scala’s vector class and Spark ML’s equivalent classes.
Getting Started with MLlib (Organization and Imports)
You can include MLlib in the same way as other Spark components. Its inclusion can be simplified using the steps discussed in “Managing Spark Dependencies” on page 31. The Maven coordinates for Spark 2.1’s MLlib are org.apache.spark:spark- mllib_2.11:2.1.0. The imports for MLlib are a little scattered compared to the other Spark components —see the imports used to train a simple classification model in Example 9-1.
Example 9-1. Sample MLlib imports for building a LogisticRegression model
import com.github.fommil.netlib.BLAS.{getInstance => blas} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel} // Rename Vector to SparkVector to avoid conflicts with Scala's Vector class import org.apache.spark.mllib.linalg.{Vector => SparkVector} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature._