You are on page 1of 12
In [1]: FIT5202 Data Processing for Big Data Assignment 1 - Part B Student Name : Pooja Vishal Pancholi Student ID : 29984939 Tutot I Day and Time : Thursday 6 to 8 PM Tutor Name: Huashun Li Step 01: Import pyspark and initialize Spark # for mongodb import os 0s. environ[ 'PYSPARK_SUBMIT_ARGS' ] packages org.mongodb. spark:mongo-spark-con # importing pyspark API Libraries from pyspark import SparkContext, SparkConf # Spark from pyspark.sql import SparkSession # Spark SQL context = SparkContext.getOrcreate() if (context is None): conf = SparkConf().setAppNane(“Assignnent1B Application"). setMaster(""local[*] context = SparkContext(conf=conf) spark = SparkSession(sparkContext=context)\ -builder\ -appName("Assignment1B Application")\ -config("spark.jars.packages", “org.mongodb. spark:mongo-spark-connect sconfig("spark.mongodb.input.uri", "mongodb: //127.0.2.1/fit5202_db..wki -config("spark.mongodb.output.uri*, "mongodb: //127.0.0.1/fitS2@2_db.w -getOrCreate() # Inporting Libraries to change the datatype from datetime import datetime from pyspark.sql.types import DateType from pyspark.sql.functions import udf Step 02: Create Dataframe In [2]: In [3]: In [4]: # Creating the dataframe by reading CSV crimeDataStats = spark.read.csv( ‘Crime Statistics SA_2010_present.csv', inferSche| # Approach 2 # crimeDataFrane = spark. createDataFrame(crimeDatastats) # printing the number of records in the csv print("The number of records in the CSV file are » crimedataStats.count(), "rec The number of records in the CSV file are: 727408 records Step 03: Write to Database # Writing the data to the Mongo Database crimebataStats write. format("com.mongodb. spark. sql.DefaultSource") .mode("overwrit Step 04: Read from Database # Loading the data from the Mongo database crimeDataStatsFrane = spark.read.format("com.nongodb. spark. sql.DefaultSource" ).10 crimeDataStatsFrame = crimeDataStatsFrame.drop("_id") # Printing the database Schema print("The schema for the Crime Data Statistics is as follows: \n") crimeDataStatsFrame.printSchema() The schema for the Crime Data Statistics is as follows: root |-- Offence Count: integer (nullable = true) Offence Level 1 Description: string (nullable = true) offence Level 2 Description: string (nullable = true) Offence Level 3 Description: string (nullable = true) Postcode - Incident: string (nullable = true) Reported Date: string (nullable = true) Suburb - Incident: string (nullable = true) Step 05: Calculate the statistics of numeric and string columns In [5]: In [6]: # Finding the statistics by grouping the two coLumms crimeDataStatsFrame.describe(['Offence Count’, ‘Reported Date’ ]).show() [summary] Offence Count|Reported Date| | count] 727407| 727407| | mean|1.1715174585892079| null | stddev|e.5787e593e378106| null| | min} 1] 1/61/2011 | max 28| — 9/12/2618| Answer: The Reported Date is in the Date range as per the data, i.e; Dates begin from 2010 in the data and goes up till 2019. Hence, the value seems to be accurate. Step 06: Change the data type of a column # User Defined function to change the Reported Date from String to DateTime def datechange(string) if string != None: return datetime. strptime(string, ‘%d/%n/2¥") return None changeDateColumn = udf(1ambda x: dateChange(x), DateType()) crimeDataStatsFrame = crimeDataStatsFrame.withColumn("Reported_Date",\ changebateColumn(crimeDataSt < » Step 07: Preliminary data analysis, a. How many level 2 offences are there? Display the list of level 2 offences. In [7]: In [8]: # After running the following Lines, it was found that there were null records in # removing those null records crimebataStatsFrameFiltered = crimebataStatsFrame.\ filter (crimeDataStatsFrame[ "Offence Level 2 Descr # counting the Level 2 distinct offences level20ffences = crimeDataStatsFrameFiltered.select("0ffence Level 2 Descriptio # printing the offences print("The number of different types of offences reported is: ", level20ffences, crimeDataStatsFrameFiltered.groupBy("Offence Level 2 Description") .count().sort(” The number of different types of offences reported is: 9 offences. + wpesereet loffence Level 2 Description] count] THEFT AND RELATED. . .|280572| PROPERTY DAMAGE A. ACTS INTENDED TO ...|112961| SERIOUS CRIMINAL ...|1¢4952| i I - |166161| I | | OTHER OFFENCES AG...| 23407] | | | | FRAUD DECEPTION A...| 19661] SEXUAL ASSAULT AN. ROBBERY AND RELAT. HOMICIDE AND RELA. | 13403] -| 5848] | 443] b, What is the number of offences against the person? # Finding the offences against the person offencePerson = crimeDataStatsFrame. filter (crimedataStatsFrame[ “Offence Level 1 Description") -count() FENCES, # printing the total number of offences against a person print("Offences against the person are:", offencePerson, “offences.") Offences against the person are: 156062 offences. , How many serious criminal tresspasses with more than 1 offence count? In [9]: In [10]: In [77]: # Finding the serious criminal tresspass data with more than 1 count criminalTresspass = crimeDataStatsFrame.\ filter ((crimeDataStatsFrame[ "Offence Level 2 Description” 'SERTOUS CRIMINAL TRESPASS” )&\ (crimebataStatsFrame[ "Offence Count" ]>1)).coun # printing the total number of tresspasses with more than one offence count print("There were", criminalTresspass, "tresspasses with more than one offence co There were 8579 tresspasses with more than one offence count. d. What percentage of crimes are offences against the property? # Finding the offence against property crimes offenceAgainstProp = crimeDataStatsFrame.\ Filter(crimebataStatsFrame[ “Offence Level 1 Description" ]\ ‘OFFENCES AGAINST PROPERTY")\, + BroupBy()\, -sum()\ -collect()(2][9] # Finding the total number of offences totaloffences = crimeDataStatsFrame.group8y()\ +sum()\, scollect()[2][2] # Calculating the percentage of offence against property percentageOffence = round((offenceAgainstProp/totaloffences) * 100, 2) print("A total of", percentageOffence, "% of crimes is \"OFfence against Property A total of 79.39 % of crimes is “Offence against Property” Step 08: Exploratory data analysis # for plotting # Ipip install matplotlib # Ipip install numpy import matplotlib.pyplot as plt import numpy as np from pyspark.sql.functions import year, month, dayofweek Xmatplotlib inline Xpylab inline Populating the interactive namespace from nunpy and matplotlib a. Find the number of crimes per year. In [47]: # For finding count of crimes per year crimeDataYear = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date" ])\ -groupby(year( ‘Reported Date" ))\ -count()\, -na.drop()\ -sort("year(Reported_Date)')\ scollect() # Fetching the Labels for the graph ticks yearLabels = [row['year(Reported_Date)'] for row in crimeDataYear] crimeCountLabels = [row['count'] for row in crimeDataYear] index = np.arange(len(yearLabels)) # Setting the label and title formatting plt.xlabel("Year", size=14) plt.ylabel("Count of Crimes", size=14) plt.title("crimes Per Year", size=18) # plotting the number of crimes per year plt.plot(index, crimeCountLabels) # setting the tick values for X axis plt.xticks(index, yearLabels) plt.show() Crimes Per Year 100000 ‘9000 +0000 70000 60000 50000 Count of Crimes 40000 30000 20000 2010 2011 2012 213 214 2015 216 217 218 2019 Year b, Find the number of crimes per month, In [48]: # For finding count of crimes per month crimeDataMonth = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date’ ])\ -groupby (month(‘Reported_Date’ ))\ -count()\, «na.drop()\ -sort(“month(Reported_Date)')\, scollect() # Fetching the Labels for the graph ticks monthLabels = [row['month(Reported Date)'] for row in crimeDataMonth] crimeCountLabels = [row["count'] for row in crimeDataMonth] index = np.arange(len(nonthLabels)) # Setting the label and title formatting plt.xlabel("Month", size=14) plt.ylabel("Count of Crimes", siz plt.title("Crimes Per Month", size index = np.arange(12) # plotting the number of crimes per month plt.plot(index, crimeCountLabels) # setting the tick values for X axis plt.xticks(np.arange(12), ['Jan', "Feb", Mar’, 'Apr’, "May", Jun’, ‘Jul’, "Aug’, "Sep", plt.show() Crimes Per Month Count of Crimes Jan Feb Mar Aor May Jun jul Aug Sep Oct Nov Dec Month c. Where do most crimes take place? Find the top 20 suburbs (which would also display postcode for e.g. Caulfield-3162 )?. In [97]: # Finding the top 26 suburbs where crime occur the most crimeDataSuburb = crimeDataStatsFrame.select (‘Suburb - Incident’, ‘Postcode - Inciv -groupby( ‘Suburb - Incident’, ‘Postcode - Incident')\ -count()\, -na.drop()\, -sort("count")\ -collect()[-21:-1] # Fetching the Labels for the bar ticks suburbLabels = [str(row[ ‘Suburb - Incident']) +'-'+str(row[ ‘Postcode - Incident] crimeCountLabels = [row['count'] for row in crimeDataSuburb] index = np.arange(en(suburbLabels)) bar_width = 0.30 pylab.rcParams[‘figure.figsize’] = (15, 9) # setting the figure size fig, ax = plt.subplots() # Setting the Label and title formatting ax.set_title("Number of crimes per suburb", size=14) ax.set_xlabel("Suburb", size=14) ax.set_ylabel("Number of Crimes", size=16) # Plotting the bar ax.bar(index + bar_width, crimeCountLabels, 1abe! ‘Agile Processes Book") # Plotting the tick values for X axis ax.set_xticks(index + bar_width) ax.set_xticklabels(suburblabels, rotation=98, ha="center') ax.tick_params(axis="both', which="minor', labelsize=13) umber of crimes per suburb Number of Crimes. ‘suburb d, Find the number of serious criminal trespasses by day and month. In [112]: # importing seaborn #Ipip install seaborn import pandas as pd import seaborn as sns # For finding count of crimes per month and day crimeDataMonandDay = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date" -groupby(month(‘Reported_Date’), dayofweek('Reported_Date’))\ -count()\ sna.drop()\ ssort('month(Reported Date)’, ‘dayofweek(Reported_Date)")\ scollect() monthLabel = [item[®] for item in crimeDataMonandDay] daysLabel = [item[1] for item in crimeDataMonandDay] countLabel = [item[2] for item in crimeDataMonandDay] # Setting the dataframe to be used by facetgrid tresspass8yMonth = {"count": countLabel, "month": monthLabel, “dayofweek": daysla tresspassByMonth = pd.DataFrame(tresspassByNonth) tresspassByMonth = tresspassByNonth.sort_values(by = "month", ascending = True) # plot the grid graph = sns.FacetGrid(tresspassByMonth, hue="dayofweek", size = 12) graph.map(plt.plot, "month", "count", linewidth = 3) graph.add_legend() # plotting the ticks for x axis plt.xticks(np.arange(12), ['Jan','Feb', Mar‘, ‘Apr’, ‘May’, ‘Jun', ‘Jul, ‘Aug plt.tick_params(axis="both', which="minor’, labelsize=13) sep's

You might also like