You are on page 1of 7

© CDOSS Association contact@cdoss.

tech

I/Apache Spark presentation


Apache Spark is a fast data processing engine dedicated to big data. It allows processing of
large volumes of data in a distributed manner (cluster computing).

Advantages: Speed, Ease of use, Versatility.

Supports In-memory processing  increase the performance of big data analytical applications.

Can also be used for conventional on-disk processing  if the data sets are too large for system
memory.

Used to process data from Hadoop Distributed File System, NoSQL databases, or relational
data stores like Apache Hive.

Historical
2009: creation within the AMPLab laboratory at the University of Berkeley by Matei Zaharia,

2010: launch in open source under BSD license,

2013: entrusted to Apache Software Foundation,

2014: placed at the rank of Top-Level Project by the Apache foundation.

Spark Vs Hadoop
Hadoop: solution of choice for processing large data sets and for "one-pass" calculations
(MapReduce),
Spark: more practical for use cases requiring multi-pass calculations (machine learning),
 using them together would be best,
 Spark can be run on Hadoop 2 clusters based on the YARN resource manager

Architecture

1
© CDOSS Association contact@cdoss.tech

Spark Data Models: Resilient Distributed Datasets (RDDs)


An RDD is a collection calculated from a data source,

Catalyst and Tungsten


 Catalyst is the name of Spark's workflow optimizer. Originally created for
[[SparkSQL]], Catalyst is also used in Datasets and Dataframes. Its role is to rewrite the
execution plan of a query (or an execution workflow) in order to obtain maximum
performance.
 Tungsten Project aims to improve the performance of Spark in particular the
optimization of the data structures used.

2
© CDOSS Association contact@cdoss.tech

Spark Data Models: Performance

When to use RDDs?

 Unstructured data,
 Need a low level of data control.

When to use DataFrames?

 Structured or semi-structured data,


 Need a high level of data control,
 Need a high level of transformation and action.

II/Concepts of statistics to know


- Statistical moments - Mean and median
- Standart Deviation - Kurtosis
- Skewness - Correlation - Covariance
III/Machine learning concepts to know
- K-means
- Decision tree
- Random Forest
- Neural Network

3
© CDOSS Association contact@cdoss.tech

IV/Practical workshop
1)RDD without Catalyst and Tungsten
1)Run pyspark on terminal.
2)Create a rdd with the following values: 1,2,3,4
Rdd=sc.parallelize([1,2,3,4])
3)Multiply rdd elements by 2 and put the result in rdd1
Rdd1=rdd.map(lambda x: x*2)
4)Filter even elements of rdd and put the result in rdd2
Rdd2=rdd.filter(lambda x: x%2 == 0)
5)Create a rdd3 with the following values: 1,4,2,2,3
Rdd3=sc.parallelize([1,4,2,2,3])
6)Select rdd3 distinct elements and put the result in rdd4
Rdd4=rdd3.distinct()
7)Create a rdd5 with the following values: 1,2,3
Rdd5=sc.parallelize([1,2,3])
8)Create a rdd6 which will contain couples according to the values of rdd5 with each value + 5
Rdd6=rdd5.map(lambda x: [x+5])
9)Create a rdd7 which will contain the values of rdd5 with each value plus 5 in flat mode
Rdd7=rdd5.flatMap(lamda x: [x,x+5])
10)Compute sum of rdd5 elements
Rdd5.reduce(lambda a,b: a*b)
11)Collect rdd5 values
Rdd5.collect()
12)Take the two first elements of rdd5
Rdd5.take(2)
13)Create a rdd8 with the following values: 5,3,1,2
Rdd8=sc.parallelize([5,3,1,2])
14)Select the three largest values of rdd8 elements
Rdd8.takeOrdered(3, lambda s: -1*s)

1.1)K-means

1) Import the KMeans function from mllib with: from pyspark.mllib.clustering import KMeans.
2) Import the array function with: from numpy import array.
3) Create a file with the following data and save it on the home:

4
© CDOSS Association contact@cdoss.tech

0.0 0.0 0.0


0.1 0.1 0.0
0.1 0.0 0.1
9.0 9.2 9.0
9.3 9.0 9.2
9.0 9.2 9.1
4)Load the data with data = sc.textFile ("name_of_your_file").
5)#this step is optional# collect the data in RDD with: data.collect().
6)Prepare the data transformation by separating the elements of each row by transforming them
into a float with: parsedData = data.map(lambda line: array([float (x) for x in line.split(' ')])).
Note: lambda represents in python a function without having a name.
7)Execute by displaying the transformed data with: parsedData.collect()
8)Start the kmeans algorithm with: clusters = KMeans.train (parsedData, 2, maxIterations = 10,
initializationMode = "random").
9)Execute (with display) the result with: clusters.predict(parsedData).collect().

1.2)Decision tree

1)Place irisnum.csv file from Downloads on the home.


2)Load irisnum.csv data with: data = sc.textFile("iris num.csv")
3)Import array from numpy with: from numpy import array
4)Prepare the transformation of the rows by separating the values and converting them to float
with:
pdata = data.map(lambda line: array([float (x) for x in line.split(‘,’)]))
5)Display pdata with: pdata.collect()
6)Import the LabeledPoint function with:
from pyspark.mllib.regression import LabeledPoint.
7)Create a function called parse allowing to label the class value of a data line received as input
with:
def parse (l):
return LabeledPoint(l[4],l[0:4])
8)Pass the lines one by one in order to label all the data with:
fdata = pdata.map(lambda l: parse(l))
9)Randomly divide the data in order to have a training base and a test base with:
(trainingData,testData) = fdata.randomSplit([0.8,0.2])
10)Import the function of decision trees with:
from pyspark.mllib.tree import DecisionTree
11)Prepare the model with:

5
© CDOSS Association contact@cdoss.tech

model = DecisionTree.trainClassifier(trainingData, numClasses = 3, categoricalFeaturesInfo =


{})
12)Perform the prediction for the test base with:
predictions = model.predict(testData.map(lambda r: r.features))
13)Build a table of two columns opposing the predictions to the real values with:
predictionAndLabels = testData.map(lambda lp: lp.label).zip(predictions)
14)Import MulticlassMetrics function for model evaluation with:
from pyspark.mllib.evaluation import MulticlassMetrics.
15)Start the evaluation function with:
metrics = MulticlassMetrics(predictionAndLabels)
16)Calculate the model precision with: precision = metrics.precision()
17)Calculate the recall of the model with recall = metrics.recall()
18)Calculate the f1Score with: f1Score = metrics.fMeasure()
19)Display a title to the results with: print("Summary Stats")
20)Display the model precision with: print("Precision =% s"% precision)
21)Display the model recall with: print("Recall =% s"% recall)
22)Display the model's f1score with: print("F1 Score =% s"% f1Score)

2)RDD with Catalyst and Tungsten


1) Place the Iris1.csv file in the home of VM.
2) Load this file with:
df = spark.read.load ("Iris1.csv", format = "csv", sep = ",", inferSchema = "true", header =
"true")
3) Here is another possibility to load the file with:
df1 = sqlContext.read.format ('csv'). Options (header = 'true', inferschema = 'true'). Load
('Iris1.csv' ).
Rq: inferSchema: automatically infers column types.
4) Count the number of lines of the data frame with: df.count()
5) Display the first 10 lines of the data frame with: df.show(10)
6) Filter and display the lines whose petal_lengths are strictly greater than 6 with:
df.filter(df["petal_length"]>6).show()
7) Count the "species" by group with: df.groupBy(df["species"]).Count().Show()
8) Display the first 10 lines with: df.head(10).
9) Select the different "species" using a sql query with:
df.registerTempTable ("table")
distinct_classes = sqlContext.sql("select distinct species from table")

6
© CDOSS Association contact@cdoss.tech

Decision Tree
1) Transform the data frame df by indexing the class variable "species" and creating a vector of
"features" with:
from pyspark.ml.feature import StringIndexer
speciesIndexer = StringIndexer(inputCol="species", outputCol="speciesIndex")
from pyspark.ml.feature import VectorAssembler
vectorAssembler=VectorAssembler(inputCols=["petal_width","petal_length","sepal_
width","sepal_length"], outputCol="features")
data = vectorAssembler.transform(df)
index_model = speciesIndexer.fit(data)
data_indexed = index_model.transform(data)
2) Randomly divide the data into the learning base and the test base with:
trainingData, testData = data_indexed.randomSplit ([0.8, 0.2], 0.0).
3) Import the decision trees function with: from pyspark.ml.classification import
DecisionTreeClassifier
4) Configure the model with: dt = DecisionTreeClassifier (). SetLabelCol ("speciesIndex").
SetFeaturesCol ("features")
5) Start training with: model = dt.fit (trainingData).
6) Perform the classification of the test database with: classifications = model.transform
(testData)
7) Repeat steps 17 to 20 for the evaluation with:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator (labelCol = "speciesIndex", predictionCol =
"prediction", metricName = "accuracy")
accuracy = evaluator.evaluate (classifications)
print ("Test set accuracy =" + str (accuracy))

You might also like