Professional Documents
Culture Documents
Xiangrui Meng
1
What is MLlib?
2
What is MLlib?
• 33 contributors
3
What is MLlib?
Algorithms:!
• classification: logistic regression, linear support vector machine
(SVM), naive Bayes
• clustering: k-means
4
Why MLlib?
5
scikit-learn?
Algorithms:!
• classification: SVM, nearest neighbors, random forest, …
6
Mahout?
Algorithms:!
• classification: logistic regression, naive Bayes, random forest, …
7
Mahout?
LIBLINEAR?
MATLAB?
R?
scikit-learn?
Weka?
8
Why MLlib?
9
Why MLlib?
10
Gradient descent
n
X
w w ↵· g(w; xi , yi )
i=1
11
k-means (scala)
12
k-means (python)
# Load and parse the data
data = sc.textFile("kmeans_data.txt")
parsedData = data.map(lambda line:
array([float(x) for x in line.split(' ‘)])).cache()
"
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations = 10,
runs = 1, initialization_mode = "kmeans||")
"
# Evaluate clustering by computing the sum of squared errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
"
cost = parsedData.map(lambda point: error(point))
.reduce(lambda x, y: x + y)
print("Sum of squared error = " + str(cost))
13
Dimension reduction
+ k-means
// compute principal components
val points: RDD[Vector] = ...
val mat = RowRDDMatrix(points)
val pc = mat.computePrincipalComponents(20)
"
// project points to a low-dimensional space
val projected = mat.multiply(pc).rows
"
// train a k-means model on the projected data
val model = KMeans.train(projected, 10)
Collaborative filtering
// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
"
// Build the recommendation model using ALS
val model = ALS.train(ratings, 1, 20, 0.01)
"
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
15
Why MLlib?
• It ships with Spark as
a standard component.
16
Out for dinner?!
• Search for a restaurant and make a reservation.
• Start navigation.
17
Why smartphone?
18
Why MLlib?
19
Spark SQL + MLlib
// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingTable = sql("""
SELECT e.action,
u.age,
u.latitude,
u.longitude
FROM Users u
JOIN Events e
ON u.userId = e.userId""")
"
// Since `sql` returns an RDD, the results of the above
// query can be easily used in MLlib.
val training = trainingTable.map { row =>
val features = Vectors.dense(row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
"
val model = SVMWithSGD.train(training)
Streaming + MLlib
// collect tweets using streaming
"
// train a k-means model
val model: KMmeansModel = ...
"
// apply model to filter tweets
val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0)))
val statuses = tweets.map(_.getText)
val filteredTweets =
statuses.filter(t => model.predict(featurize(t)) == clusterNumber)
"
// print tweets within this particular cluster
filteredTweets.print()
GraphX + MLlib
// assemble link graph
val graph = Graph(pages, links)
val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices
"
// load page labels (spam or not) and content features
val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ...
val training: RDD[LabeledPoint] =
labelAndFeatures.join(pageRank).map {
case (id, ((label, features), pageRank)) =>
LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank))
}
"
// train a spam detector using logistic regression
val model = LogisticRegressionWithSGD.train(training)
Why MLlib?
• Spark is a general-purpose big data platform.
• Reads from HDFS, S3, HBase, and any Hadoop data source.
23
Why MLlib?
• Spark is a general-purpose big data platform.
• Reads from HDFS, S3, HBase, and any Hadoop data source.
24
Why MLlib?
• Scalability
• Performance
• User-friendly APIs
25
Logistic regression
26
Logistic regression - weak scaling
10 4000
MLbase
MLlib n=6K, d=160K
VW
n=12.5K, d=160K
8 Ideal
3000 n=25K, d=160K
n=50K, d=160K
relative walltime
walltime (s)
6 n=100K, d=160K
2000 n=200K, d=160K
4
1000
2
0 0
0 5 10 15 20 25 30 MLbase
MLlib VW Matlab
# machines
20
27
15
wa
0
0 5 10 15 20 25 30 1000
# machines
35
Fig. 5: Walltime for weak scaling for logistic regress
MLlib
MLbase
30 VW 1400
1 Machine
Ideal
1200 2 Machines
25 4 Machines
1000 8 Machines
16 Machines
speedup
walltime (s)
20
800 32 Machines
15 600
10 400
200
5
0
MLbase
MLlib VW Matlab
0
0 5 10 15 20 25 30
# machines Fig. 7: Walltime for strong scaling for logistic regress
29
Collaborative filtering
30
ALS - wall-clock time
System Wall-‐clock /me (seconds)
MATLAB 15443
Mahout 4206
GraphLab 291
MLlib 481
Initialization:
• random
• k-means++
• k-means||
Implementation of k-means
Iterations:
• broadcast everything
• data parallel
• fully parallel
35
Alternating least squares (ALS)
36
Broadcast everything
• Master loads (small)
data file and initializes
Ratings
models.
• Master broadcasts
data and initial
Movie!
Factors
models.
• At each iteration,
updated models are
User! broadcast again.
Factors
• Lots of
communication
overhead - doesn’t
Master scale well.
Workers
Data parallel
Ratings
• Workers load data
• Master broadcasts
initial models
Movie!
Factors Ratings
• At each iteration,
updated models are
broadcast again
User!
Factors
• Much better scaling
Ratings
• Works on large
datasets
• Models are
instantiated at
Ratings
Movie!
workers.
User!
Factors
Factors
• At each iteration,
models are shared via
join between workers.
Ratings
Movie!
User!
Factors
• Much better
Factors scalability.
• Works on large
datasets
Ratings
Movie!
User!
Factors
Factors
Master
Workers
39
Implementation of ALS
• broadcast everything
• data parallel
• fully parallel
• block-wise parallel
• Users/products are partitioned into blocks and join is
based on blocks instead of individual user/product.
40
New features for v1.x
• Sparse data
• L-BFGS
• Model evaluation
• Discretization
41
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan
Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau,
Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark
Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath,
Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy
Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui
Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Ryan LeCompte
42
Interested?
• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Github: https://github.com/apache/spark
43