Professional Documents
Culture Documents
Support of Big Data Machine Learning With Apache Spark
Support of Big Data Machine Learning With Apache Spark
tech
Supports In-memory processing increase the performance of big data analytical applications.
Can also be used for conventional on-disk processing if the data sets are too large for system
memory.
Used to process data from Hadoop Distributed File System, NoSQL databases, or relational
data stores like Apache Hive.
Historical
2009: creation within the AMPLab laboratory at the University of Berkeley by Matei Zaharia,
Spark Vs Hadoop
Hadoop: solution of choice for processing large data sets and for "one-pass" calculations
(MapReduce),
Spark: more practical for use cases requiring multi-pass calculations (machine learning),
using them together would be best,
Spark can be run on Hadoop 2 clusters based on the YARN resource manager
Architecture
1
© CDOSS Association contact@cdoss.tech
2
© CDOSS Association contact@cdoss.tech
Unstructured data,
Need a low level of data control.
3
© CDOSS Association contact@cdoss.tech
IV/Practical workshop
1)RDD without Catalyst and Tungsten
1)Run pyspark on terminal.
2)Create a rdd with the following values: 1,2,3,4
Rdd=sc.parallelize([1,2,3,4])
3)Multiply rdd elements by 2 and put the result in rdd1
Rdd1=rdd.map(lambda x: x*2)
4)Filter even elements of rdd and put the result in rdd2
Rdd2=rdd.filter(lambda x: x%2 == 0)
5)Create a rdd3 with the following values: 1,4,2,2,3
Rdd3=sc.parallelize([1,4,2,2,3])
6)Select rdd3 distinct elements and put the result in rdd4
Rdd4=rdd3.distinct()
7)Create a rdd5 with the following values: 1,2,3
Rdd5=sc.parallelize([1,2,3])
8)Create a rdd6 which will contain couples according to the values of rdd5 with each value + 5
Rdd6=rdd5.map(lambda x: [x+5])
9)Create a rdd7 which will contain the values of rdd5 with each value plus 5 in flat mode
Rdd7=rdd5.flatMap(lamda x: [x,x+5])
10)Compute sum of rdd5 elements
Rdd5.reduce(lambda a,b: a*b)
11)Collect rdd5 values
Rdd5.collect()
12)Take the two first elements of rdd5
Rdd5.take(2)
13)Create a rdd8 with the following values: 5,3,1,2
Rdd8=sc.parallelize([5,3,1,2])
14)Select the three largest values of rdd8 elements
Rdd8.takeOrdered(3, lambda s: -1*s)
1.1)K-means
1) Import the KMeans function from mllib with: from pyspark.mllib.clustering import KMeans.
2) Import the array function with: from numpy import array.
3) Create a file with the following data and save it on the home:
4
© CDOSS Association contact@cdoss.tech
1.2)Decision tree
5
© CDOSS Association contact@cdoss.tech
6
© CDOSS Association contact@cdoss.tech
Decision Tree
1) Transform the data frame df by indexing the class variable "species" and creating a vector of
"features" with:
from pyspark.ml.feature import StringIndexer
speciesIndexer = StringIndexer(inputCol="species", outputCol="speciesIndex")
from pyspark.ml.feature import VectorAssembler
vectorAssembler=VectorAssembler(inputCols=["petal_width","petal_length","sepal_
width","sepal_length"], outputCol="features")
data = vectorAssembler.transform(df)
index_model = speciesIndexer.fit(data)
data_indexed = index_model.transform(data)
2) Randomly divide the data into the learning base and the test base with:
trainingData, testData = data_indexed.randomSplit ([0.8, 0.2], 0.0).
3) Import the decision trees function with: from pyspark.ml.classification import
DecisionTreeClassifier
4) Configure the model with: dt = DecisionTreeClassifier (). SetLabelCol ("speciesIndex").
SetFeaturesCol ("features")
5) Start training with: model = dt.fit (trainingData).
6) Perform the classification of the test database with: classifications = model.transform
(testData)
7) Repeat steps 17 to 20 for the evaluation with:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator (labelCol = "speciesIndex", predictionCol =
"prediction", metricName = "accuracy")
accuracy = evaluator.evaluate (classifications)
print ("Test set accuracy =" + str (accuracy))