You are on page 1of 13
FIT5202 Data Processing for Big Data Assignment 2 Student Name : Pooja Vishal Pancholi Student ID : 29984939 Tutorial Day and Time : Thursday 6 to 8 PM Tutor Name: Huashun Li A.Creating Spark Session and Loading the Data Step 01: Import Spark Session and initialize Spark In [1]: In [2]: # importing pyspark API Libraries from pyspark import SparkContext, SparkConf # Spark from pyspark.sql import SparkSession # Spark SQL context = SparkContext.getOrCreate() if (context is None): conf = SparkCon#().setAppName(""Assignnent2 Application")\ «setMaster("local[4]") context = SparkContext (conf=conf) spark = SparkSession(sparkContext=context)\ -builder\ -appName("Assignment2 Application")\ -getOrCreate() import numpy as np import pyspark.sql.functions as funct import pyspark.sql.types as typ from pyspark.nl.feature import StringIndexer from pyspark.ml import Pipeline from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssenbler # for classification algorithms from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssen from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationtvaluator from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.classification import RandonForestClassifier from pyspark.ml.classification import GBIClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.mllib.evaluation import MulticlassMetrics import matplotlib.pyplot as plt import numpy as np Ymatplotlib inline Xpylab inline Populating the interactive namespace from numpy and matplotlib Step 02: Load the dataset and print the schema and total number of entries, # Creating the dataframe by reading csv weatherbataStats = spark.read.csv( ‘weatheraUs.csv', inferSchema=True, header=True) # printing the number of records in the csv print("The number of records in the CSV file are: ", weatherDataStats.count(), "records") The number of records in the CSV file are: 142193 records B. Data Cleaning and Processing Step 03: Delete columns from the dataset In [3]: |# To drop the required colums from the Dataframe weatherDataStats = weatherDataStats.drop( ‘Date’, ‘Location’, ‘Evaporation’, ‘*Sunshine', *Cloudgan", *Cloud3pm", ‘Tenpsan' “Tenp3pn') # To check if the columns are actually deleted from the Dataframe weatherbataStats.printSchema() root |-- Mintemp: string (nullable = true) MaxTemp: string (nullable = true) Rainfall: string (nullable = true) WindGustDir: string (nullable = true) WindGustSpeed: string (nullable = true) WindDirgam: string (nullable = true) WindDir3pm: string (nullable = true) WindSpeedgam: string (nullable = true) WindSpeed3pm: string (nullable = true) Hunidity9am: string (nullable = true) Humidity3pm: string (nullable = true) Pressure9an: string (nullable = true) Pressure3pm: string (nullable = true) RainToday: string (nullable = true) RainTomorrow: string (nullable = true) ‘Step 04: Print the number of missing data in each column. In [4]: #weatherDataStats. show(n=52) nullcount = weatherDatastats. select ({ funct..count(funct .when(funct.isnan(i) | \ funct.col(i).contains('NA") | \ funct .col(i).contains('NULL') | \ funct.col(i).isnull(), 4)).aldas(i) \ for i in weatherataStats.colunns]) nullCount. show() [MinTemp|MaxTemp |Rainfal1 |WindGustDir |WindGustSpeed |WindDir9am|WindDir3pm|WindS peed9an|WindSpeed3pn|HuniditySan|Humidity3pn |PressureSan|Pressure3pm|RainToday | RainTororrow| | 637] 322] 1486] 9336| 9270| 19013 3778| 1348] 2630| 174) 3610] 14014] 13981| 1406 Step 05: Fill the missing data with average value and maximum occurrence value. In [5]: columns = ["MinTemp", “Pressure3pn", “Hunidity3pm") mp", “Rainfall”, "Pressuresa JindGustSpeed", “WindSpeed9am "WindSpeed3pm", “Humid # replacing all nulls weatherDataStats = weatherDataStats.replace('NA’, None) # to calculate the exact mean without null values removeNulls = weatherDataStats.na.drop() # in my case, i have filtered out the null values # to calculate the mean so that it does not affect the overall mean for i in columns: weatherDataStats = weatherDataStats.withColunn(i, weatherbatastats[i].cast(ty meanValue = renoveNulls.agg(funct -avg(i)).first()[2] aweatherDataStats = weatherDataStats.na.fill(meanValue, [i]) weatherDataStats = weatherDataStats.\ withColumn(i,,\ ‘func .when\, (funct. isnull (funct.col(i)),\ round(meanvalue, 2)).otherwise(weatherbat # for the columsn having string values colunnsString = ["WindGustDir", “WindDirgan", “WindDir3pm", "RainToday", "RainTom for i in columsstring: count = weatherbatastats.groupBy(weatherbataStats[i]).count().sort("count", a weatherbatastats = weatherDataStats.\ withColumn(i,\ ‘funct.when\ (funct. isnull(funct.col(i)),\ count [i]) otherwise (weatherbatastats[i]) # weatherDataStats. show(16) # weatherbataStats.printSchema() Step 06: Data transformation In [6]: colunns_list = [‘WindGustDir', ‘WindDir9am’, ‘WindDir3pm’, 'RainToday', ‘RainTomo # using the string indexer for i in columns_list: L_indexer = StringIndexer(inputCol = i, outputCol=i + "_new") weatherDataStats = 1_indexer. fit (weatherDataStats) . trans form(weatherDataStats weatherDataStats = weatherDataStats.drop(i) weatherbataStats.printSchema() root |-- MinTemp: double (nullable = true) MaxTemp: double (nullable = true) Rainfall: double (nullable = true) WindGustSpeed: double (nullable = true) WindSpeed9am: double (nullable - true) WindSpeed3pm: double (nullable = true) Hunidity9an: double (nullable = true) Hunidity3pn: double (nullable = true) Pressure9am: double (nullable = true) Pressure3pm: double (nullable = true) WindGustDir_new: double (nullable = false) WindDir9am_new: double (nullable = false) WindDir3pm_new: double (nullable = false) RainToday_new: double (nullable = false) -+ RainTomorrow_new: double (nullable = false) Step 07: Create the feature vector and divide the dataset In [7]: In [8]: # using the vector assembler columnsPredict = ['MinTemp', 'NaxTemp', ‘Rainfall’, ‘WindGustSpeed", ‘WindSpeed9a\ ‘Humidity9am', "Humidity3pm', “Pressure9am', "Pressure3pm', ‘Win. "RainToday_new'] vector_assenbler = VectorAssenbler(\ inputCols=colunnsPredict,,\ outputCol="features # transforming the data temporaryData = vector_assembler.transform(weatherDataStats) temporaryData. show(3) for i in columnsPredict: temporaryData = temporaryData.drop(i) # adding it to the prediction dataframe predictionData = tenporaryData predictionbata. show(S) |MinTemp|MaxTenp | Rainfall |WindGustSpeed |WindSpeed9am|WindSpeed3pm |Hunidityam|H unidi ty3pm | PressureSam |Pressure3pm|WindGustDir_new|WindDir9am_new|WindDir3pm_ne w/RainToday_new]RainTonorrow_new| features| | 13.4] 22.9] @.6| 44.0] 20.0| 24.0| 71.0] 22.6] 1007.7 1007.1 2.0 6.0 7.0 2.0 @.0|[13.4,22.9,0.6,44...| | 7.4] 25.1] @.8] 44.01 4.0! 22.0 44.0] 25.0] 101@.6| 1007.8 9.0] 9.0] 3.2| 2.0] €.0|(7.4,25.1,0.0,44....| | 12.9] 25.7) 0.0 46.6] 19.6] 26.0] 38.0] 30.0] 1007.6| 1008.7| 6.0 6.0 3.0] 2.0] @.0|[12.9,25.7,2.0,46...| only showing top 3 rows | RainTomorrow_new| features| 0| [13.4,22.9,0.6,44...| @|[7.4,25.1,0.0,44....| @|[12.9,25.7,0.0,46...| 2] | @ | [9.2,28.0,0.0,24. [[a7.5,32.3,1.0,42. @ e. @. @. ) only showing top 5 rows (trainingData, testData) = predictionData.randonSplit([@.7, 2.3]) In [9]: In [10]: C. Apply Machine Learning Algorithms Step 08: Apply machine learning classification algorithms on the dataset and compare their accuracy. Plot the accuracy as bar graph. # Predict the data In = LogisticRegression(featuresCol = ‘features’, labelCol = ‘RainTomorrow_new', IModel = Ir.fit(trainingdata) predictions = IrModel.transform(testData) # Evaluate the accuracy evaluator =\ MulticlassClassificationEvaluator(1abelCol="RainTomorrow_new",\ predictionCol="prediction", metricName="accuracy") # printing the accuracy accuracy- evaluator. evaluate (predictions) print("Accuracy: " accuracy) Accuracy: @.814272234017457 Decision Tree Classifier Implementation # Predict the data decisionTree = DecisionTreeClassifier(featuresCol = ‘features’, labelCol = ‘RainT decisionTreeModel = decisionTree.fit(trainingData) predictionsbt = decisionTreeModel .transform(testData) # predictionsDt. show() # Evaluate the accuracy evaluator = MulticlassClassificationEvaluator(\, labelCol="RainTomorrow_new", predictionCol="prediction™,\ metricName="accuracy") # Calculating and printing the accuracy accuracyDt = evaluator. evaluate(predictionsDt) print("Accuracy: " , accuracyDt) Accuracy: @.8234253361641897 Random forest algorithm In [11]: In [12]: # Predict the data randonForest = RandomForestClassifier(1abelcol: featuresCol="features", numTrees=10) RainTomorrow_new",\ randonForestModel = randonForest.fit(trainingbata) predictionsRF = randonForestModel .transform(testData) predictionsRF .select("predictio ", "RainTomorrow_new”).show(S) # Evaluate the accuracy evaluator =\ MulticlassClassificationEvaluator(1abelCo! predictionCol="prediction", metricName= RainTomorrow_new” ,\ curacy") # Calculating and printing the accuracy accuracyrf = evaluator.evaluate(predictions) print("Accuracy: " , accuracyrf) Iprediction|RainTomorrow_new| only showing top 5 rows Accuracy: @.814272234017457 Implementing the Gradient-Boosted Tree Classifier # Predict the data gbtc = GBTClassifier(maxtter=12, labelCol ="RainTomorrow_new") gbtModel = gbtc.fit(trainingdata) predictionsGbt = gbtModel.transform(testData) # Evaluate the accuracy evaluator = MulticlassCl: predictionCol="predictioi sificationEvaluator(labelCol="RainTomorrow_new",\ > metricName="accuracy”) # Calculating and printing the accuracy accuracygbt = evaluator.evaluate(predictionsGbt) print("Accuracy: " , accuracygbt) Accuracy: @,8358339230950695 In [13]: ftsetting the index index = np.arange(1) bar_width = 0.10 pylab.rcParams['figure.figsize'] = (15, 4) # setting the figure size fig, ax = plt.subplots() # plotting the subplot # Setting the Label and Title formatting ax.set_xlabel( ‘Accuracy’, size=14) ax.set_ylabel('Nodel', size=14) ax.set_title('Bar chart to compare accuracy’, size=18) # plotting the bars for each book data = ax.bar(index, accuracyDt, bar_width, label="Decision tree Classification") data = ax.bar(index + bar_width +2, accuracyrf, bar_width, label="Random Forest C data = ax.bar(index + bar_width + 4, accuracy, bar_width, label="Logistic Regress data = ax.bar(index + bar width + 6, accuracygbt, bar_width, label="Gradient-Boos # Show the plot ax. legend() plt.show() Bar chart to compare accuracy i (Sante Repenon lsat accuracy ‘Step 09: Calculate the confusion matrix and find the precision, recall, and F1 score of each classification algorithm, Explain how the accuracy of the predication can be improved? In [14]: dtcreates a List of all prediction datasets ListDatasets = [predictionsDt, predictionsRF predictions, predictionsGbt] wCalculate statistics for the given datasets for predictedValues in [0,1,2, 3]: if predictedvalues print("Stats for the Decision Tree Classifier: if predictedValues print("\n\nstats for the Random Forest Classifier:") if predictedValues == 2: print("\n\nStats for the Logistic Regression Classifier:") if predictedValues print("\n\nStats for the GBT Classifier:") #calculate metrics predictionAndLabels = listbatasets[predictedvalues].select('RainTonorrow_new” metrics = MulticlassMetrics (predictionAndLabels .rdd.map(tuple)) #get confusion matrix confusionMatrix = metrics. confusionMatrix().toArray() TP = confusionMatrix(0] [0] FP = confusionMatrix[o] [1] IN = confusionMatrix(1][1] FN = confusionMatrix(1][2] fprint confusion matrix print("Confusion Matrix:\n") print("\t\t Actual Positive\tActual Negative") print("Predicted Positive “aste(TP)+" “aste(FP)) print("Predicted Negative 4str(FN)+! “aste(TN)) #ealculate and print precision precision = 1P/(1P+FP) print("\nprecison: " + str(precision)) #calculate and print recall recall = 1P/(TP+FN) print(“recall: " + str(recall)) #calculate and print f1 score 1 = precision*recall/(precision+recall) print accuracy print(“accuracy: “sstr(metrics.accuracy)) Stats for the Decision Tree Classifier: Confusion Matrix: Actual Positive Actual Negative Predicted Positive 31549. 6293.0 Predicted Negative 1192.0 3356.0 precison: @.8337032926378098 recall: 0,9635930484713356 accuracy: @.8234253361641897 Stats for the Random Forest Classifier: Confusion Matrix: Actual Positive Actual Negative Predicted Positive 3178.0 6389.0 Predicted Negative 963.0 3260.0 @.8326040820604187 .9705873369781008 @.8265628686010852 Stats for the Logistic Regression Classifier: Confusion Matrix: Actual Positive Actual Negative Predicted Positive 30492. 5624.0 Predicted Negative 2249.0 4025.0 1 @.8442795436925462 -9313093674597599 1 @.814272234017457 Stats for the GBT Classifier: Confusion Matrix: Actual Positive Actual Negative Predicted Positive 31067. 5285.0 Predicted Negative 1674.0 4364.0 precison: @.854615977112676 recall: @,9488714455881005 accuracy: @,8358339230950695 My Evaluation The exactness of the forecasts can be improved with the expansion of increasingly precise information. Highlight designing can likewise be performed to improve precision. For instance, making canister for numeric information, include creation or choosing specific highlights dependent on space learning could make the expectations increasingly precise and more efficient (On account of the choice tree classifier: The most extreme number of receptacles can be tuned by altering the max_bins parameter to a worth that outcomes in increased precision. The most extreme profundity of the tree can likewise be acclimated to a worth that could guarantee higher exactness and productivity. In the case of the Random forest classifier: The algorithm can be tuned by increasing the value of max_features or n_estimators parameters. This would increase the accuracy of the model in each case. In the case of the logistic regressiong classifier: The accuracy with this model can be improved py performing hyperparameter tuning using grid search.Cross validation can also be implemented to train the model in addition to adjusting the threshhold In the case of the GBT classifier: Cross validation can be performed to train the model prior to making predictions, by assigning weights to various features, by tuning the maximum depth and by tuning the number of bins End of the Assignment. | hope you like my work :)

You might also like