Support of Big Data Machine Learning With Apache Spark

© CDOSS Association contact@cdoss.
tech
I/Apache Spark presentation

Apache Spark is a fast data processing engine dedicated to big data. It allows processing of
large volumes of data in a distributed manner (cluster computing).
Advantages: Speed, Ease of use, Versatility.
Supports In-memory processing  increase the performance of big data analytical applications.
Can also be used for conventional on-disk processing  if the data sets are too large for system
memory.
Used to process data from Hadoop Distributed File System, NoSQL databases, or relational
data stores like Apache Hive.
Historical
2009: creation within the AMPLab laboratory at the University of Berkeley by Matei Zaharia,
2010: launch in open source under BSD license,
2013: entrusted to Apache Software Foundation,
2014: placed at the rank of Top-Level Project by the Apache foundation.
Spark Vs Hadoop
Hadoop: solution of choice for processing large data sets and for "one-pass" calculations
(MapReduce),
Spark: more practical for use cases requiring multi-pass calculations (machine learning),
 using them together would be best,
 Spark can be run on Hadoop 2 clusters based on the YARN resource manager
Architecture
1
© CDOSS Association contact@cdoss.tech
Spark Data Models: Resilient Distributed Datasets (RDDs)

An RDD is a collection calculated from a data source,
Catalyst and Tungsten

 Catalyst is the name of Spark's workflow optimizer. Originally created for
[[SparkSQL]], Catalyst is also used in Datasets and Dataframes. Its role is to rewrite the
execution plan of a query (or an execution workflow) in order to obtain maximum
performance.
 Tungsten Project aims to improve the performance of Spark in particular the
optimization of the data structures used.
2
Spark Data Models: Performance
When to use RDDs?
 Unstructured data,
 Need a low level of data control.
When to use DataFrames?
 Structured or semi-structured data,

 Need a high level of data control,
 Need a high level of transformation and action.
II/Concepts of statistics to know

- Statistical moments - Mean and median
- Standart Deviation - Kurtosis
- Skewness - Correlation - Covariance
III/Machine learning concepts to know
- K-means
- Decision tree
- Random Forest
- Neural Network
3
IV/Practical workshop
1)RDD without Catalyst and Tungsten
1)Run pyspark on terminal.
2)Create a rdd with the following values: 1,2,3,4
Rdd=sc.parallelize([1,2,3,4])
3)Multiply rdd elements by 2 and put the result in rdd1
Rdd1=rdd.map(lambda x: x*2)
4)Filter even elements of rdd and put the result in rdd2
Rdd2=rdd.filter(lambda x: x%2 == 0)
5)Create a rdd3 with the following values: 1,4,2,2,3
Rdd3=sc.parallelize([1,4,2,2,3])
6)Select rdd3 distinct elements and put the result in rdd4
Rdd4=rdd3.distinct()
7)Create a rdd5 with the following values: 1,2,3
Rdd5=sc.parallelize([1,2,3])
8)Create a rdd6 which will contain couples according to the values of rdd5 with each value + 5
Rdd6=rdd5.map(lambda x: [x+5])
9)Create a rdd7 which will contain the values of rdd5 with each value plus 5 in flat mode
Rdd7=rdd5.flatMap(lamda x: [x,x+5])
10)Compute sum of rdd5 elements
Rdd5.reduce(lambda a,b: a*b)
11)Collect rdd5 values
Rdd5.collect()
12)Take the two first elements of rdd5
Rdd5.take(2)
13)Create a rdd8 with the following values: 5,3,1,2
Rdd8=sc.parallelize([5,3,1,2])
14)Select the three largest values of rdd8 elements
Rdd8.takeOrdered(3, lambda s: -1*s)
1.1)K-means
1) Import the KMeans function from mllib with: from pyspark.mllib.clustering import KMeans.
2) Import the array function with: from numpy import array.
3) Create a file with the following data and save it on the home:
4
0.0 0.0 0.0

0.1 0.1 0.0
0.1 0.0 0.1
9.0 9.2 9.0
9.3 9.0 9.2
9.0 9.2 9.1
4)Load the data with data = sc.textFile ("name_of_your_file").
5)#this step is optional# collect the data in RDD with: data.collect().
6)Prepare the data transformation by separating the elements of each row by transforming them
into a float with: parsedData = data.map(lambda line: array([float (x) for x in line.split(' ')])).
Note: lambda represents in python a function without having a name.
7)Execute by displaying the transformed data with: parsedData.collect()
8)Start the kmeans algorithm with: clusters = KMeans.train (parsedData, 2, maxIterations = 10,
initializationMode = "random").
9)Execute (with display) the result with: clusters.predict(parsedData).collect().
1.2)Decision tree
1)Place irisnum.csv file from Downloads on the home.

2)Load irisnum.csv data with: data = sc.textFile("iris num.csv")
3)Import array from numpy with: from numpy import array
4)Prepare the transformation of the rows by separating the values and converting them to float
with:
pdata = data.map(lambda line: array([float (x) for x in line.split(‘,’)]))
5)Display pdata with: pdata.collect()
6)Import the LabeledPoint function with:
from pyspark.mllib.regression import LabeledPoint.
7)Create a function called parse allowing to label the class value of a data line received as input
with:
def parse (l):
return LabeledPoint(l[4],l[0:4])
8)Pass the lines one by one in order to label all the data with:
fdata = pdata.map(lambda l: parse(l))
9)Randomly divide the data in order to have a training base and a test base with:
(trainingData,testData) = fdata.randomSplit([0.8,0.2])
10)Import the function of decision trees with:
from pyspark.mllib.tree import DecisionTree
11)Prepare the model with:
5
model = DecisionTree.trainClassifier(trainingData, numClasses = 3, categoricalFeaturesInfo =

{})
12)Perform the prediction for the test base with:
predictions = model.predict(testData.map(lambda r: r.features))
13)Build a table of two columns opposing the predictions to the real values with:
predictionAndLabels = testData.map(lambda lp: lp.label).zip(predictions)
14)Import MulticlassMetrics function for model evaluation with:
from pyspark.mllib.evaluation import MulticlassMetrics.
15)Start the evaluation function with:
metrics = MulticlassMetrics(predictionAndLabels)
16)Calculate the model precision with: precision = metrics.precision()
17)Calculate the recall of the model with recall = metrics.recall()
18)Calculate the f1Score with: f1Score = metrics.fMeasure()
19)Display a title to the results with: print("Summary Stats")
20)Display the model precision with: print("Precision =% s"% precision)
21)Display the model recall with: print("Recall =% s"% recall)
22)Display the model's f1score with: print("F1 Score =% s"% f1Score)
2)RDD with Catalyst and Tungsten

1) Place the Iris1.csv file in the home of VM.
2) Load this file with:
df = spark.read.load ("Iris1.csv", format = "csv", sep = ",", inferSchema = "true", header =
"true")
3) Here is another possibility to load the file with:
df1 = sqlContext.read.format ('csv'). Options (header = 'true', inferschema = 'true'). Load
('Iris1.csv' ).
Rq: inferSchema: automatically infers column types.
4) Count the number of lines of the data frame with: df.count()
5) Display the first 10 lines of the data frame with: df.show(10)
6) Filter and display the lines whose petal_lengths are strictly greater than 6 with:
df.filter(df["petal_length"]>6).show()
7) Count the "species" by group with: df.groupBy(df["species"]).Count().Show()
8) Display the first 10 lines with: df.head(10).
9) Select the different "species" using a sql query with:
df.registerTempTable ("table")
distinct_classes = sqlContext.sql("select distinct species from table")
6
Decision Tree
1) Transform the data frame df by indexing the class variable "species" and creating a vector of
"features" with:
from pyspark.ml.feature import StringIndexer
speciesIndexer = StringIndexer(inputCol="species", outputCol="speciesIndex")
from pyspark.ml.feature import VectorAssembler
vectorAssembler=VectorAssembler(inputCols=["petal_width","petal_length","sepal_
width","sepal_length"], outputCol="features")
data = vectorAssembler.transform(df)
index_model = speciesIndexer.fit(data)
data_indexed = index_model.transform(data)
2) Randomly divide the data into the learning base and the test base with:
trainingData, testData = data_indexed.randomSplit ([0.8, 0.2], 0.0).
3) Import the decision trees function with: from pyspark.ml.classification import
DecisionTreeClassifier
4) Configure the model with: dt = DecisionTreeClassifier (). SetLabelCol ("speciesIndex").
SetFeaturesCol ("features")
5) Start training with: model = dt.fit (trainingData).
6) Perform the classification of the test database with: classifications = model.transform
(testData)
7) Repeat steps 17 to 20 for the evaluation with:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator (labelCol = "speciesIndex", predictionCol =
"prediction", metricName = "accuracy")
accuracy = evaluator.evaluate (classifications)
print ("Test set accuracy =" + str (accuracy))

Support of Big Data Machine Learning With Apache Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support of Big Data Machine Learning With Apache Spark

Uploaded by

Copyright:

Available Formats

© CDOSS Association contact@cdoss.

I/Apache Spark presentation

Advantages: Speed, Ease of use, Versatility.

2010: launch in open source under BSD license,

2013: entrusted to Apache Software Foundation,

2014: placed at the rank of Top-Level Project by the Apache foundation.

Spark Data Models: Resilient Distributed Datasets (RDDs)

Catalyst and Tungsten

Spark Data Models: Performance

When to use RDDs?

When to use DataFrames?

 Structured or semi-structured data,

II/Concepts of statistics to know

0.0 0.0 0.0

1)Place irisnum.csv file from Downloads on the home.

model = DecisionTree.trainClassifier(trainingData, numClasses = 3, categoricalFeaturesInfo =

2)RDD with Catalyst and Tungsten

You might also like