Cleaning the data and processing Resilient Distributed Datasets and applying machine learning algorithms to predict the data and find out the accuracy of the predicted data.
Cleaning the data and processing Resilient Distributed Datasets and applying machine learning algorithms to predict the data and find out the accuracy of the predicted data.
Cleaning the data and processing Resilient Distributed Datasets and applying machine learning algorithms to predict the data and find out the accuracy of the predicted data.
FIT5202 Data Processing for Big Data
Assignment 2
Student Name : Pooja Vishal Pancholi
Student ID : 29984939
Tutorial Day and Time : Thursday 6 to 8 PM
Tutor Name: Huashun Li
A.Creating Spark Session and Loading the Data
Step 01: Import Spark Session and initialize SparkIn [1]:
In [2]:
# importing pyspark API Libraries
from pyspark import SparkContext, SparkConf # Spark
from pyspark.sql import SparkSession # Spark SQL
context = SparkContext.getOrCreate()
if (context is None):
conf = SparkCon#().setAppName(""Assignnent2 Application")\
«setMaster("local[4]")
context = SparkContext (conf=conf)
spark = SparkSession(sparkContext=context)\
-builder\
-appName("Assignment2 Application")\
-getOrCreate()
import numpy as np
import pyspark.sql.functions as funct
import pyspark.sql.types as typ
from pyspark.nl.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssenbler
# for classification algorithms
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssen
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationtvaluator
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandonForestClassifier
from pyspark.ml.classification import GBIClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
import matplotlib.pyplot as plt
import numpy as np
Ymatplotlib inline
Xpylab inline
Populating the interactive namespace from numpy and matplotlib
Step 02: Load the dataset and print the schema and total number of entries,
# Creating the dataframe by reading csv
weatherbataStats = spark.read.csv( ‘weatheraUs.csv',
inferSchema=True, header=True)
# printing the number of records in the csv
print("The number of records in the CSV file are: ",
weatherDataStats.count(), "records")
The number of records in the CSV file are: 142193 recordsB. Data Cleaning and Processing
Step 03: Delete columns from the dataset
In [3]: |# To drop the required colums from the Dataframe
weatherDataStats = weatherDataStats.drop( ‘Date’,
‘Location’,
‘Evaporation’,
‘*Sunshine',
*Cloudgan",
*Cloud3pm",
‘Tenpsan'
“Tenp3pn')
# To check if the columns are actually deleted from the Dataframe
weatherbataStats.printSchema()
root
|-- Mintemp: string (nullable = true)
MaxTemp: string (nullable = true)
Rainfall: string (nullable = true)
WindGustDir: string (nullable = true)
WindGustSpeed: string (nullable = true)
WindDirgam: string (nullable = true)
WindDir3pm: string (nullable = true)
WindSpeedgam: string (nullable = true)
WindSpeed3pm: string (nullable = true)
Hunidity9am: string (nullable = true)
Humidity3pm: string (nullable = true)
Pressure9an: string (nullable = true)
Pressure3pm: string (nullable = true)
RainToday: string (nullable = true)
RainTomorrow: string (nullable = true)
‘Step 04: Print the number of missing data in each column.In [4]:
#weatherDataStats. show(n=52)
nullcount = weatherDatastats. select ({ funct..count(funct .when(funct.isnan(i) | \
funct.col(i).contains('NA") | \
funct .col(i).contains('NULL') | \
funct.col(i).isnull(), 4)).aldas(i) \
for i in weatherataStats.colunns])
nullCount. show()
[MinTemp|MaxTemp |Rainfal1 |WindGustDir |WindGustSpeed |WindDir9am|WindDir3pm|WindS
peed9an|WindSpeed3pn|HuniditySan|Humidity3pn |PressureSan|Pressure3pm|RainToday |
RainTororrow|
| 637] 322] 1486] 9336| 9270| 19013 3778|
1348] 2630| 174) 3610] 14014] 13981| 1406
Step 05: Fill the missing data with average value and maximum occurrence value.In [5]:
columns = ["MinTemp",
“Pressure3pn",
“Hunidity3pm")
mp", “Rainfall”, "Pressuresa
JindGustSpeed", “WindSpeed9am
"WindSpeed3pm", “Humid
# replacing all nulls
weatherDataStats = weatherDataStats.replace('NA’, None)
# to calculate the exact mean without null values
removeNulls = weatherDataStats.na.drop()
# in my case, i have filtered out the null values
# to calculate the mean so that it does not affect the overall mean
for i in columns:
weatherDataStats = weatherDataStats.withColunn(i, weatherbatastats[i].cast(ty
meanValue = renoveNulls.agg(funct -avg(i)).first()[2]
aweatherDataStats = weatherDataStats.na.fill(meanValue, [i])
weatherDataStats = weatherDataStats.\
withColumn(i,,\
‘func .when\,
(funct. isnull (funct.col(i)),\
round(meanvalue, 2)).otherwise(weatherbat
# for the columsn having string values
colunnsString = ["WindGustDir", “WindDirgan", “WindDir3pm", "RainToday", "RainTom
for i in columsstring:
count = weatherbatastats.groupBy(weatherbataStats[i]).count().sort("count", a
weatherbatastats = weatherDataStats.\
withColumn(i,\
‘funct.when\
(funct. isnull(funct.col(i)),\
count [i]) otherwise (weatherbatastats[i])
# weatherDataStats. show(16)
# weatherbataStats.printSchema()
Step 06: Data transformationIn [6]:
colunns_list = [‘WindGustDir', ‘WindDir9am’, ‘WindDir3pm’, 'RainToday', ‘RainTomo
# using the string indexer
for i in columns_list:
L_indexer = StringIndexer(inputCol = i, outputCol=i + "_new")
weatherDataStats = 1_indexer. fit (weatherDataStats) . trans form(weatherDataStats
weatherDataStats = weatherDataStats.drop(i)
weatherbataStats.printSchema()
root
|-- MinTemp: double (nullable = true)
MaxTemp: double (nullable = true)
Rainfall: double (nullable = true)
WindGustSpeed: double (nullable = true)
WindSpeed9am: double (nullable - true)
WindSpeed3pm: double (nullable = true)
Hunidity9an: double (nullable = true)
Hunidity3pn: double (nullable = true)
Pressure9am: double (nullable = true)
Pressure3pm: double (nullable = true)
WindGustDir_new: double (nullable = false)
WindDir9am_new: double (nullable = false)
WindDir3pm_new: double (nullable = false)
RainToday_new: double (nullable = false)
-+ RainTomorrow_new: double (nullable = false)
Step 07: Create the feature vector and divide the datasetIn [7]:
In [8]:
# using the vector assembler
columnsPredict = ['MinTemp', 'NaxTemp', ‘Rainfall’, ‘WindGustSpeed", ‘WindSpeed9a\
‘Humidity9am', "Humidity3pm', “Pressure9am', "Pressure3pm', ‘Win.
"RainToday_new']
vector_assenbler = VectorAssenbler(\
inputCols=colunnsPredict,,\
outputCol="features
# transforming the data
temporaryData = vector_assembler.transform(weatherDataStats)
temporaryData. show(3)
for i in columnsPredict:
temporaryData = temporaryData.drop(i)
# adding it to the prediction dataframe
predictionData = tenporaryData
predictionbata. show(S)
|MinTemp|MaxTenp | Rainfall |WindGustSpeed |WindSpeed9am|WindSpeed3pm |Hunidityam|H
unidi ty3pm | PressureSam |Pressure3pm|WindGustDir_new|WindDir9am_new|WindDir3pm_ne
w/RainToday_new]RainTonorrow_new| features|
| 13.4] 22.9] @.6| 44.0] 20.0| 24.0| 71.0]
22.6] 1007.7 1007.1 2.0 6.0 7.0
2.0 @.0|[13.4,22.9,0.6,44...|
| 7.4] 25.1] @.8] 44.01 4.0! 22.0 44.0]
25.0] 101@.6| 1007.8 9.0] 9.0] 3.2|
2.0] €.0|(7.4,25.1,0.0,44....|
| 12.9] 25.7) 0.0 46.6] 19.6] 26.0] 38.0]
30.0] 1007.6| 1008.7| 6.0 6.0 3.0]
2.0] @.0|[12.9,25.7,2.0,46...|
only showing top 3 rows
| RainTomorrow_new|
features|
0| [13.4,22.9,0.6,44...|
@|[7.4,25.1,0.0,44....|
@|[12.9,25.7,0.0,46...|
2] |
@ |
[9.2,28.0,0.0,24.
[[a7.5,32.3,1.0,42.
@
e.
@.
@.
)
only showing top 5 rows
(trainingData, testData) = predictionData.randonSplit([@.7, 2.3])In [9]:
In [10]:
C. Apply Machine Learning Algorithms
Step 08: Apply machine learning classification algorithms on the dataset and compare their
accuracy. Plot the accuracy as bar graph.
# Predict the data
In = LogisticRegression(featuresCol = ‘features’, labelCol = ‘RainTomorrow_new',
IModel = Ir.fit(trainingdata)
predictions = IrModel.transform(testData)
# Evaluate the accuracy
evaluator =\
MulticlassClassificationEvaluator(1abelCol="RainTomorrow_new",\
predictionCol="prediction", metricName="accuracy")
# printing the accuracy
accuracy- evaluator. evaluate (predictions)
print("Accuracy: " accuracy)
Accuracy: @.814272234017457
Decision Tree Classifier Implementation
# Predict the data
decisionTree = DecisionTreeClassifier(featuresCol = ‘features’, labelCol = ‘RainT
decisionTreeModel = decisionTree.fit(trainingData)
predictionsbt = decisionTreeModel .transform(testData)
# predictionsDt. show()
# Evaluate the accuracy
evaluator = MulticlassClassificationEvaluator(\,
labelCol="RainTomorrow_new", predictionCol="prediction™,\
metricName="accuracy")
# Calculating and printing the accuracy
accuracyDt = evaluator. evaluate(predictionsDt)
print("Accuracy: " , accuracyDt)
Accuracy: @.8234253361641897
Random forest algorithmIn [11]:
In [12]:
# Predict the data
randonForest = RandomForestClassifier(1abelcol:
featuresCol="features", numTrees=10)
RainTomorrow_new",\
randonForestModel = randonForest.fit(trainingbata)
predictionsRF = randonForestModel .transform(testData)
predictionsRF .select("predictio
", "RainTomorrow_new”).show(S)
# Evaluate the accuracy
evaluator =\
MulticlassClassificationEvaluator(1abelCo!
predictionCol="prediction", metricName=
RainTomorrow_new” ,\
curacy")
# Calculating and printing the accuracy
accuracyrf = evaluator.evaluate(predictions)
print("Accuracy: " , accuracyrf)
Iprediction|RainTomorrow_new|
only showing top 5 rows
Accuracy: @.814272234017457
Implementing the Gradient-Boosted Tree Classifier
# Predict the data
gbtc = GBTClassifier(maxtter=12, labelCol ="RainTomorrow_new")
gbtModel = gbtc.fit(trainingdata)
predictionsGbt = gbtModel.transform(testData)
# Evaluate the accuracy
evaluator = MulticlassCl:
predictionCol="predictioi
sificationEvaluator(labelCol="RainTomorrow_new",\
> metricName="accuracy”)
# Calculating and printing the accuracy
accuracygbt = evaluator.evaluate(predictionsGbt)
print("Accuracy: " , accuracygbt)
Accuracy: @,8358339230950695In [13]:
ftsetting the index
index = np.arange(1)
bar_width = 0.10
pylab.rcParams['figure.figsize'] = (15, 4) # setting the figure size
fig, ax = plt.subplots() # plotting the subplot
# Setting the Label and Title formatting
ax.set_xlabel( ‘Accuracy’, size=14)
ax.set_ylabel('Nodel', size=14)
ax.set_title('Bar chart to compare accuracy’, size=18)
# plotting the bars for each book
data = ax.bar(index, accuracyDt, bar_width, label="Decision tree Classification")
data = ax.bar(index + bar_width +2, accuracyrf, bar_width, label="Random Forest C
data = ax.bar(index + bar_width + 4, accuracy, bar_width, label="Logistic Regress
data = ax.bar(index + bar width + 6, accuracygbt, bar_width, label="Gradient-Boos
# Show the plot
ax. legend()
plt.show()
Bar chart to compare accuracy
i
(Sante Repenon lsat
accuracy
‘Step 09: Calculate the confusion matrix and find the precision, recall, and F1 score of each
classification algorithm, Explain how the accuracy of the predication can be improved?In [14]:
dtcreates a List of all prediction datasets
ListDatasets = [predictionsDt, predictionsRF predictions, predictionsGbt]
wCalculate statistics for the given datasets
for predictedValues in [0,1,2, 3]:
if predictedvalues
print("Stats for the Decision Tree Classifier:
if predictedValues
print("\n\nstats for the Random Forest Classifier:")
if predictedValues == 2:
print("\n\nStats for the Logistic Regression Classifier:")
if predictedValues
print("\n\nStats for the GBT Classifier:")
#calculate metrics
predictionAndLabels = listbatasets[predictedvalues].select('RainTonorrow_new”
metrics = MulticlassMetrics (predictionAndLabels .rdd.map(tuple))
#get confusion matrix
confusionMatrix = metrics. confusionMatrix().toArray()
TP = confusionMatrix(0] [0]
FP = confusionMatrix[o] [1]
IN = confusionMatrix(1][1]
FN = confusionMatrix(1][2]
fprint confusion matrix
print("Confusion Matrix:\n")
print("\t\t Actual Positive\tActual Negative")
print("Predicted Positive “aste(TP)+" “aste(FP))
print("Predicted Negative 4str(FN)+! “aste(TN))
#ealculate and print precision
precision = 1P/(1P+FP)
print("\nprecison: " + str(precision))
#calculate and print recall
recall = 1P/(TP+FN)
print(“recall: " + str(recall))
#calculate and print f1 score
1 = precision*recall/(precision+recall)
print accuracy
print(“accuracy: “sstr(metrics.accuracy))
Stats for the Decision Tree Classifier:
Confusion Matrix:
Actual Positive Actual Negative
Predicted Positive 31549. 6293.0
Predicted Negative 1192.0 3356.0
precison: @.8337032926378098
recall: 0,9635930484713356
accuracy: @.8234253361641897
Stats for the Random Forest Classifier:
Confusion Matrix:
Actual Positive Actual NegativePredicted Positive 3178.0 6389.0
Predicted Negative 963.0 3260.0
@.8326040820604187
.9705873369781008
@.8265628686010852
Stats for the Logistic Regression Classifier:
Confusion Matrix:
Actual Positive Actual Negative
Predicted Positive 30492. 5624.0
Predicted Negative 2249.0 4025.0
1 @.8442795436925462
-9313093674597599
1 @.814272234017457
Stats for the GBT Classifier:
Confusion Matrix:
Actual Positive Actual Negative
Predicted Positive 31067. 5285.0
Predicted Negative 1674.0 4364.0
precison: @.854615977112676
recall: @,9488714455881005
accuracy: @,8358339230950695
My Evaluation
The exactness of the forecasts can be improved with the expansion of increasingly precise
information. Highlight designing can likewise be performed to improve precision. For instance,
making canister for numeric information, include creation or choosing specific highlights dependent
on space learning could make the expectations increasingly precise and more efficient
(On account of the choice tree classifier:
The most extreme number of receptacles can be tuned by altering the max_bins parameter to a
worth that outcomes in increased precision. The most extreme profundity of the tree can likewise be
acclimated to a worth that could guarantee higher exactness and productivity.
In the case of the Random forest classifier: The algorithm can be tuned by increasing the value of
max_features or n_estimators parameters. This would increase the accuracy of the model in each
case. In the case of the logistic regressiong classifier: The accuracy with this model can be
improved py performing hyperparameter tuning using grid search.Cross validation can also be
implemented to train the model in addition to adjusting the threshhold In the case of the GBT
classifier: Cross validation can be performed to train the model prior to making predictions, by
assigning weights to various features, by tuning the maximum depth and by tuning the number of
bins
End of the Assignment. | hope you like my work :)