You are on page 1of 4

Data Science for Social Good: Crime Study

Project:
Data Science for Social Good: Crime Study
Mohammad Osama
1
Bahcesehir University, Data Analytics Department, 34000 Besiktas, Istanbul, Turkey, mohammad.osama@bahcesehir.edu.tr

Article Info Abstract


We will use publicly accessible information in this project to interpret crime
trends within the city of San Francisco. The SF Police Department will be
releasing a new dataset of Police Incidents that will cover incidents from
January 1, 2018 to the present as this new dataset will update daily.
Keywords: ggplot, dplyr, and other
In this project, the dataset used is hosted on Kaggle and updated daily.
R packages So, the data may provide evidence when evaluated responsibly to accurately
explain difficult social problems. Data challenges ingenious individuals to
bring influential use of the raw input.

1. INTRODUCTION significant statistics, and easy visualizations to explain


crime patterns along the way.
Within one's culture, quantitative research can have a
huge effect on facilitating change. Within the context of IMPLEMENTATION
contemporary social problems, it is important to Firstly, in attempt to comprehend all that we have, we
consider the importance of technical research skills. must put our eyes all over the information available.
Data science expertise will significantly improve one’s
usefulness as an individual or potential employee within Let us take a look at the details and see if the two
the corporate workplace.  However, it is possible to use datasets have any variables that are the same or similar.
these same skills to positively influence the society and Then we can query these variables for a carefully study
contribute to its development.  and return a standard figure such as a frequency count.

1.1. Objective # Load required packages


install.packages("tidyverse")
We will explore, analyze and draw inferences from a install.packages("lubridate")
San Francisco crime data set in order to clarify the install.packages("readr")
correlation between civilian-reported crime incidents install.packages("ggmap")
library(tidyverse)
and police-reported crime incidents.
library(lubridate)
1.2. Scope library(ggmap)
# .... CODE FOR TASK 1 ....
The idea of open science has led to new open data # Read in incidents dataset
requirements, and an exciting abundance of raw incidents <-
information is ready for hypotheses to be evaluated. read_csv("police.department.incidents.csv")
The number of social good data science-focused # Glimpse the structure of incidents datasets
organizations are rising rapidly, and the voluntary glimpse(incidents)
community work seems to have a modern interpretation # Read in calls dataset
and methodology. calls <-
Many local and federal governments encourage read_csv("police.department.calls.for.service.csv")
accessibility to insightful datasets; the significance of # Glimpse the structure of incidents datasets
open information becomes more widespread as this glimpse(incidents)
trend progresses.
Now that we have a clearer understanding of the
2. METHODOLOGY variables in our data collection, we can see that there are
  In effort to subset our data, we will use table common variables that will enable us to address a
intersection methods, aggregation methods to measure greater number of questions. We may investigate about

1
Data Science for Social Good: Crime Study

the association between crimes reported by civilians and plot_shared_dates <- shared_dates %>%
crimes reported by the police by the dates within which gather(key = report, value = count, -Date)
the incidents were registered. plot_shared_dates
# Aggregate the number of reported incidents by Date # Plot points and regression trend lines
daily_incidents <- incidents %>%
count(Date, sort = TRUE) %>%
ggplot(plot_shared_dates, aes(x = Date, y = count, color
rename(n_incidents = n) = report)) +
# Glimpse the structure of Aggregated daily_incidents dataset geom_point() +
glimpse(daily_incidents) geom_smooth(method = "lm", formula = y ~ x)

# Glimpse the structure of calls dataset Calculating a correlation coefficient between data
glimpse(calls) vectors is a more quantitative way to distinguish the
# Aggregate the number of calls for police service by Date relationship between 2 variables. The statistics are
daily_calls <- calls %>%
expressed in a spectrum from -1 to +1; in fact, they
count(`Call Date`, sort = TRUE) %>%
rename( Date = `Call Date`, n_calls = n) represent a linear dependency between two data sets. It
# Glimpse the structure of Aggregated daily_calls dataset is possible to view the correlation coefficient as
glimpse(daily_calls) complete negative correlation if -1, no correlation
whatsoever if 0, and perfect positive correlation if +1. In
order to explain how our understanding of statistics will
We are going to conduct a kind of mutating significantly affect the conclusions we come to; we will
join between the data frames to consolidate these details. look at 2 separate but connected coefficients of
The new data set structure only retains days where correlation.
accidents were registered by both people and incidents # Calculate correlation coefficient between daily
that were witnessed by police. frequencies
daily_cor<-cor(shared_dates$n_calls,
# Join data frames to create a new "mutated" set of
shared_dates$n_incidents)
information
daily_cor
shared_dates <- daily_calls%>%
inner_join(daily_incidents, by="Date")
We can first verify whether there is a connection on a
# Take a glimpse of this new data frame
day-to-day basis between the number of events and
shared_dates
calls, but it could be too sophisticated. Through
summarizing the main the data into monthly counts and
We now have a data frame containing fresh data set
estimating a correlation coefficient, it can be useful to
created by merging datasets. In order to grasp this new
take a clearer view of underlying patterns.
knowledge, we must imagine it as our knowledge of
# Summarize frequencies by month
crime statistics is constrained by a table of raw numbers.
correlation_df <-shared_dates %>%
So, let us try to concisely describe the results, which will
mutate(month = month(Date)) %>%
lead to other concerns. To better determine if there is a
group_by(month) %>%
relationship between these factors, we can look at the
summarise(n_incidents = sum(n_incidents),
number of calls and events over time.
n_calls = sum(n_calls))
In order to perform new activities such as plotting, # Calculate correlation coefficient between monthly
restructuring of data is often required. Instead of "long frequencies
format" data, plotting is suitable for "wide format" data. monthly_cor <- cor(correlation_df$n_incidents,
We would like one column to represent the dates to plot correlation_df$n_calls)
time series results, one column to represent the counts, monthly_cor
and one column to map each observation to either of There are cases in which it is useful to sub-set
them. This enables them to move each x, y, and color knowledge dependent on another set of values when
statement to a every single column. To describe the dealing with relational datasets.  Filtering joins are an
columns for events and calls, we want the main column. alternative form of join that helps one to retain all
The value column specifies the corresponding counts for unique cases within a data frame while maintaining the
each of these variables. The outcome is a long and data frame structure itself. It would be useful to provide
narrow data frame with several rows for each date all the data from every incident identified by police and
observation by keeping the Date column off the key. every civilian call on their shared dates so that from
each dataset we can measure identical figures and
# Gather into long format using the "Date" column to compare outcomes. In this case, we can use shared dates
define observations

2
Data Science for Social Good: Crime Study

to subset all the complete data frames for calls and chance crimes in which unattended cars are broken into.
events. However, this is possibly just the case where the
# Subset calls to police by shared_dates location of a so-called-in-crime is equal to the location
calls_shared_dates <- semi_join(calls, shared_dates, by of the incident of the crime. Let's search to see if the
= c("Call Date" = "Date")) positions of the most common reported citizen crime
glimpse(calls_shared_dates) and reported crime by the police are close.
# Perform a sanity check that we are using this filtering # Aggregate the number of reported incidents by Crime
join function appropriately TYPE
sort(unique(shared_dates$Date))== incidents_types <- incidents %>%
sort(unique(calls_shared_dates$"Call Date")) count(Descript, sort = TRUE) %>%
# Filter recorded incidents by shared_dates rename( CrimeType = "Descript", n_incidents = n)
incidents_shared_dates <- filter(incidents, Date %in% #Aggregated incidents dataset
shared_dates$Date) incidents_types
glimpse(incidents_shared_dates)
# Perform a sanity check that we are using this filtering glimpse(calls)
join function appropriately # Aggregate the number of calls for police service by
sort(unique(shared_dates$Date))==sort(unique(incidents Crime TYPE
_shared_dates$Date)) calls_crime_types <- calls %>%
count(`Original Crime Type Name`, sort = TRUE) %>
Now, after entering the data sets, we need to see what %
the data looks like. We have previously generated a rename(CrimeType = `Original Crime Type Name` ,
scatterplot and applied a linear model to the data to see n_calls = n )
how the number of calls and the frequency of recorded # Aggregated calls dataset
events have trended over time. Scatterplots are a good calls_crime_types
way to look at general continuous data patterns.
However, to see patterns in categorical results, in order # Join data frames to create a new "mutated" set of
to consider their degrees of significance, we need to information
imagine the ranked order of the variables. shared_crime_types <- calls_crime_types%>%
# Create a bar chart of the number of calls for each semi_join(incidents_types, by="CrimeType")
crime # Take a glimpse of this new data frame
plot_calls_freq <- calls_shared_dates %>% shared_crime_types
count(`Original Crime Type Name`) %>%
top_n(10, n) %>% # Arrange the top 10 locations of called in crimes in a
ggplot(aes(y = reorder(`Original Crime Type Name`, new variable
n), x = n)) + location_calls <- calls_shared_dates %>%
geom_bar(stat = "identity") + filter(`Original Crime Type Name` == "Auto Boost /
xlab("Count") + Strip")%>%
ylab("Crime Description") + count(Address) %>%
ggtitle("Calls Reported Crimes") arrange(desc(n))%>%
# Output the plots top_n(10, n)
plot_calls_freq location_calls
# Create a bar chart of the number of reported incidents
for each crime # Arrange the top 10 locations of reported incidents in a
plot_incidents_freq <- incidents_shared_dates %>% new variable
count(Descript) %>% location_incidents <- incidents_shared_dates %>%
top_n(10, n) %>% filter(Descript == "GRAND THEFT FROM LOCKED
ggplot(aes(x = n, y =reorder(Descript, n) )) + AUTO") %>%
geom_bar(stat = 'identity') + count(Address) %>%
ylab("Crime Description") + arrange(desc(n))%>%
xlab("Count") + top_n(10, n)
ggtitle("Incidents Reported Crimes") location_incidents
# Output the plots
plot_incidents_freq # Print the top locations of each dataset for comparison
location_calls
 "GRAND THEFT AUTO" is far and away the felony of location_incidents
highest occurrence. This group potentially captures more # Filter grand theft auto incidents

3
Data Science for Social Good: Crime Study

auto_incidents <- incidents_shared_dates %>%


filter(Descript == "GRAND THEFT FROM LOCKED
AUTO")
glimpse(auto_incidents)

3. CONCLUSION

It seems that the datasets share areas where auto crimes


occur and are most commonly reported. It would be nice
to map the co-occurrence of these locations to imagine
overlaps, but for police incidents reported, we only have
longitude and latitude data. This will provide us with
immediate insight into where auto crimes take place.
This visualization can, most significantly, provide an
effective medium of communication.
It becomes apparent that certain aspects of each dataset
are not standardized as we ask deeper questions (such as 4. REFERENCES
the address variable in each data frame and the absence
City of San Francisco (2018, Septmeber). SF Police
of exact position data in calls) and thus need more
Calls for Service and Incidents, Version 141. Retrieved
advanced research. 16th December, 2020 from https://www.kaggle.com/san-
francisco/sf-police-calls-for-service-and-incidents.

This dataset is maintained using Socrata's API and


Kaggle's API. This dataset is distributed under the
following licenses: Open Data Commons Public Domain
Dedication and License

You might also like