In [1]:
FIT5202 Data Processing for Big Data
Assignment 1 - Part B
Student Name : Pooja Vishal Pancholi
Student ID : 29984939
Tutot
I Day and Time : Thursday 6 to 8 PM
Tutor Name: Huashun Li
Step 01: Import pyspark and initialize Spark
# for mongodb
import os
0s. environ[ 'PYSPARK_SUBMIT_ARGS' ]
packages org.mongodb. spark:mongo-spark-con
# importing pyspark API Libraries
from pyspark import SparkContext, SparkConf # Spark
from pyspark.sql import SparkSession # Spark SQL
context = SparkContext.getOrcreate()
if (context is None):
conf = SparkConf().setAppNane(“Assignnent1B Application"). setMaster(""local[*]
context = SparkContext(conf=conf)
spark = SparkSession(sparkContext=context)\
-builder\
-appName("Assignment1B Application")\
-config("spark.jars.packages", “org.mongodb. spark:mongo-spark-connect
sconfig("spark.mongodb.input.uri", "mongodb: //127.0.2.1/fit5202_db..wki
-config("spark.mongodb.output.uri*, "mongodb: //127.0.0.1/fitS2@2_db.w
-getOrCreate()
# Inporting Libraries to change the datatype
from datetime import datetime
from pyspark.sql.types import DateType
from pyspark.sql.functions import udf
Step 02: Create DataframeIn [2]:
In [3]:
In [4]:
# Creating the dataframe by reading CSV
crimeDataStats = spark.read.csv( ‘Crime Statistics SA_2010_present.csv', inferSche|
# Approach 2
# crimeDataFrane = spark. createDataFrame(crimeDatastats)
# printing the number of records in the csv
print("The number of records in the CSV file are
» crimedataStats.count(), "rec
The number of records in the CSV file are: 727408 records
Step 03: Write to Database
# Writing the data to the Mongo Database
crimebataStats write. format("com.mongodb. spark. sql.DefaultSource") .mode("overwrit
Step 04: Read from Database
# Loading the data from the Mongo database
crimeDataStatsFrane = spark.read.format("com.nongodb. spark. sql.DefaultSource" ).10
crimeDataStatsFrame = crimeDataStatsFrame.drop("_id")
# Printing the database Schema
print("The schema for the Crime Data Statistics is as follows: \n")
crimeDataStatsFrame.printSchema()
The schema for the Crime Data Statistics is as follows:
root
|-- Offence Count: integer (nullable = true)
Offence Level 1 Description: string (nullable = true)
offence Level 2 Description: string (nullable = true)
Offence Level 3 Description: string (nullable = true)
Postcode - Incident: string (nullable = true)
Reported Date: string (nullable = true)
Suburb - Incident: string (nullable = true)
Step 05: Calculate the statistics of numeric and string columnsIn [5]:
In [6]:
# Finding the statistics by grouping the two coLumms
crimeDataStatsFrame.describe(['Offence Count’, ‘Reported Date’ ]).show()
[summary] Offence Count|Reported Date|
| count] 727407| 727407|
| mean|1.1715174585892079| null
| stddev|e.5787e593e378106| null|
| min} 1] 1/61/2011
| max 28| — 9/12/2618|
Answer: The Reported Date is in the Date range as per the data, i.e; Dates begin from 2010 in
the data and goes up till 2019. Hence, the value seems to be accurate.
Step 06: Change the data type of a column
# User Defined function to change the Reported Date from String to DateTime
def datechange(string)
if string != None:
return datetime. strptime(string, ‘%d/%n/2¥")
return None
changeDateColumn = udf(1ambda x: dateChange(x), DateType())
crimeDataStatsFrame = crimeDataStatsFrame.withColumn("Reported_Date",\
changebateColumn(crimeDataSt
< »
Step 07: Preliminary data analysis,
a. How many level 2 offences are there? Display the list of level 2 offences.In [7]:
In [8]:
# After running the following Lines, it was found that there were null records in
# removing those null records
crimebataStatsFrameFiltered = crimebataStatsFrame.\
filter (crimeDataStatsFrame[ "Offence Level 2 Descr
# counting the Level 2 distinct offences
level20ffences = crimeDataStatsFrameFiltered.select("0ffence Level 2 Descriptio
# printing the offences
print("The number of different types of offences reported is: ", level20ffences,
crimeDataStatsFrameFiltered.groupBy("Offence Level 2 Description") .count().sort(”
The number of different types of offences reported is: 9 offences.
+ wpesereet
loffence Level 2 Description] count]
THEFT AND RELATED. . .|280572|
PROPERTY DAMAGE A.
ACTS INTENDED TO ...|112961|
SERIOUS CRIMINAL ...|1¢4952|
i
I - |166161|
I
|
| OTHER OFFENCES AG...| 23407]
|
|
|
|
FRAUD DECEPTION A...| 19661]
SEXUAL ASSAULT AN.
ROBBERY AND RELAT.
HOMICIDE AND RELA.
| 13403]
-| 5848]
| 443]
b, What is the number of offences against the person?
# Finding the offences against the person
offencePerson = crimeDataStatsFrame. filter
(crimedataStatsFrame[ “Offence Level 1 Description")
-count()
FENCES,
# printing the total number of offences against a person
print("Offences against the person are:", offencePerson, “offences.")
Offences against the person are: 156062 offences.
, How many serious criminal tresspasses with more than 1 offence count?In [9]:
In [10]:
In [77]:
# Finding the serious criminal tresspass data with more than 1 count
criminalTresspass = crimeDataStatsFrame.\
filter ((crimeDataStatsFrame[ "Offence Level 2 Description”
'SERTOUS CRIMINAL TRESPASS” )&\
(crimebataStatsFrame[ "Offence Count" ]>1)).coun
# printing the total number of tresspasses with more than one offence count
print("There were", criminalTresspass, "tresspasses with more than one offence co
There were 8579 tresspasses with more than one offence count.
d. What percentage of crimes are offences against the property?
# Finding the offence against property crimes
offenceAgainstProp = crimeDataStatsFrame.\
Filter(crimebataStatsFrame[ “Offence Level 1 Description" ]\
‘OFFENCES AGAINST PROPERTY")\,
+ BroupBy()\,
-sum()\
-collect()(2][9]
# Finding the total number of offences
totaloffences = crimeDataStatsFrame.group8y()\
+sum()\,
scollect()[2][2]
# Calculating the percentage of offence against property
percentageOffence = round((offenceAgainstProp/totaloffences) * 100, 2)
print("A total of", percentageOffence, "% of crimes is \"OFfence against Property
A total of 79.39 % of crimes is “Offence against Property”
Step 08: Exploratory data analysis
# for plotting
# Ipip install matplotlib
# Ipip install numpy
import matplotlib.pyplot as plt
import numpy as np
from pyspark.sql.functions import year, month, dayofweek
Xmatplotlib inline
Xpylab inline
Populating the interactive namespace from nunpy and matplotlib
a. Find the number of crimes per year.In [47]:
# For finding count of crimes per year
crimeDataYear = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date" ])\
-groupby(year( ‘Reported Date" ))\
-count()\,
-na.drop()\
-sort("year(Reported_Date)')\
scollect()
# Fetching the Labels for the graph ticks
yearLabels = [row['year(Reported_Date)'] for row in crimeDataYear]
crimeCountLabels = [row['count'] for row in crimeDataYear]
index = np.arange(len(yearLabels))
# Setting the label and title formatting
plt.xlabel("Year", size=14)
plt.ylabel("Count of Crimes", size=14)
plt.title("crimes Per Year", size=18)
# plotting the number of crimes per year
plt.plot(index, crimeCountLabels)
# setting the tick values for X axis
plt.xticks(index, yearLabels)
plt.show()
Crimes Per Year
100000
‘9000
+0000
70000
60000
50000
Count of Crimes
40000
30000
20000
2010 2011 2012 213 214 2015 216 217 218 2019
Year
b, Find the number of crimes per month,In [48]:
# For finding count of crimes per month
crimeDataMonth = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date’ ])\
-groupby (month(‘Reported_Date’ ))\
-count()\,
«na.drop()\
-sort(“month(Reported_Date)')\,
scollect()
# Fetching the Labels for the graph ticks
monthLabels = [row['month(Reported Date)'] for row in crimeDataMonth]
crimeCountLabels = [row["count'] for row in crimeDataMonth]
index = np.arange(len(nonthLabels))
# Setting the label and title formatting
plt.xlabel("Month", size=14)
plt.ylabel("Count of Crimes", siz
plt.title("Crimes Per Month", size
index = np.arange(12)
# plotting the number of crimes per month
plt.plot(index, crimeCountLabels)
# setting the tick values for X axis
plt.xticks(np.arange(12), ['Jan', "Feb", Mar’, 'Apr’, "May", Jun’, ‘Jul’, "Aug’, "Sep",
plt.show()
Crimes Per Month
Count of Crimes
Jan Feb Mar Aor May Jun jul Aug Sep Oct Nov Dec
Month
c. Where do most crimes take place? Find the top 20 suburbs (which would also display
postcode for e.g. Caulfield-3162 )?.In [97]:
# Finding the top 26 suburbs where crime occur the most
crimeDataSuburb = crimeDataStatsFrame.select (‘Suburb - Incident’, ‘Postcode - Inciv
-groupby( ‘Suburb - Incident’, ‘Postcode - Incident')\
-count()\,
-na.drop()\,
-sort("count")\
-collect()[-21:-1]
# Fetching the Labels for the bar ticks
suburbLabels = [str(row[ ‘Suburb - Incident']) +'-'+str(row[ ‘Postcode - Incident]
crimeCountLabels = [row['count'] for row in crimeDataSuburb]
index = np.arange(en(suburbLabels))
bar_width = 0.30
pylab.rcParams[‘figure.figsize’] = (15, 9) # setting the figure size
fig, ax = plt.subplots()
# Setting the Label and title formatting
ax.set_title("Number of crimes per suburb", size=14)
ax.set_xlabel("Suburb", size=14)
ax.set_ylabel("Number of Crimes", size=16)
# Plotting the bar
ax.bar(index + bar_width, crimeCountLabels, 1abe!
‘Agile Processes Book")
# Plotting the tick values for X axis
ax.set_xticks(index + bar_width)
ax.set_xticklabels(suburblabels, rotation=98, ha="center')
ax.tick_params(axis="both', which="minor', labelsize=13)
umber of crimes per suburb
Number of Crimes.
‘suburbd, Find the number of serious criminal trespasses by day and month.In [112]:
# importing seaborn
#Ipip install seaborn
import pandas as pd
import seaborn as sns
# For finding count of crimes per month and day
crimeDataMonandDay = crimeDataStatsFrame.select([ ‘Offence Count’, ‘Reported Date"
-groupby(month(‘Reported_Date’), dayofweek('Reported_Date’))\
-count()\
sna.drop()\
ssort('month(Reported Date)’, ‘dayofweek(Reported_Date)")\
scollect()
monthLabel = [item[®] for item in crimeDataMonandDay]
daysLabel = [item[1] for item in crimeDataMonandDay]
countLabel = [item[2] for item in crimeDataMonandDay]
# Setting the dataframe to be used by facetgrid
tresspass8yMonth = {"count": countLabel, "month": monthLabel, “dayofweek": daysla
tresspassByMonth = pd.DataFrame(tresspassByNonth)
tresspassByMonth = tresspassByNonth.sort_values(by = "month", ascending = True)
# plot the grid
graph = sns.FacetGrid(tresspassByMonth, hue="dayofweek", size = 12)
graph.map(plt.plot, "month", "count", linewidth = 3)
graph.add_legend()
# plotting the ticks for x axis
plt.xticks(np.arange(12), ['Jan','Feb', Mar‘, ‘Apr’, ‘May’, ‘Jun', ‘Jul, ‘Aug
plt.tick_params(axis="both', which="minor’, labelsize=13)
sep's