Professional Documents
Culture Documents
Project:
Data Science for Social Good: Crime Study
Mohammad Osama
1
Bahcesehir University, Data Analytics Department, 34000 Besiktas, Istanbul, Turkey, mohammad.osama@bahcesehir.edu.tr
1
Data Science for Social Good: Crime Study
the association between crimes reported by civilians and plot_shared_dates <- shared_dates %>%
crimes reported by the police by the dates within which gather(key = report, value = count, -Date)
the incidents were registered. plot_shared_dates
# Aggregate the number of reported incidents by Date # Plot points and regression trend lines
daily_incidents <- incidents %>%
count(Date, sort = TRUE) %>%
ggplot(plot_shared_dates, aes(x = Date, y = count, color
rename(n_incidents = n) = report)) +
# Glimpse the structure of Aggregated daily_incidents dataset geom_point() +
glimpse(daily_incidents) geom_smooth(method = "lm", formula = y ~ x)
# Glimpse the structure of calls dataset Calculating a correlation coefficient between data
glimpse(calls) vectors is a more quantitative way to distinguish the
# Aggregate the number of calls for police service by Date relationship between 2 variables. The statistics are
daily_calls <- calls %>%
expressed in a spectrum from -1 to +1; in fact, they
count(`Call Date`, sort = TRUE) %>%
rename( Date = `Call Date`, n_calls = n) represent a linear dependency between two data sets. It
# Glimpse the structure of Aggregated daily_calls dataset is possible to view the correlation coefficient as
glimpse(daily_calls) complete negative correlation if -1, no correlation
whatsoever if 0, and perfect positive correlation if +1. In
order to explain how our understanding of statistics will
We are going to conduct a kind of mutating significantly affect the conclusions we come to; we will
join between the data frames to consolidate these details. look at 2 separate but connected coefficients of
The new data set structure only retains days where correlation.
accidents were registered by both people and incidents # Calculate correlation coefficient between daily
that were witnessed by police. frequencies
daily_cor<-cor(shared_dates$n_calls,
# Join data frames to create a new "mutated" set of
shared_dates$n_incidents)
information
daily_cor
shared_dates <- daily_calls%>%
inner_join(daily_incidents, by="Date")
We can first verify whether there is a connection on a
# Take a glimpse of this new data frame
day-to-day basis between the number of events and
shared_dates
calls, but it could be too sophisticated. Through
summarizing the main the data into monthly counts and
We now have a data frame containing fresh data set
estimating a correlation coefficient, it can be useful to
created by merging datasets. In order to grasp this new
take a clearer view of underlying patterns.
knowledge, we must imagine it as our knowledge of
# Summarize frequencies by month
crime statistics is constrained by a table of raw numbers.
correlation_df <-shared_dates %>%
So, let us try to concisely describe the results, which will
mutate(month = month(Date)) %>%
lead to other concerns. To better determine if there is a
group_by(month) %>%
relationship between these factors, we can look at the
summarise(n_incidents = sum(n_incidents),
number of calls and events over time.
n_calls = sum(n_calls))
In order to perform new activities such as plotting, # Calculate correlation coefficient between monthly
restructuring of data is often required. Instead of "long frequencies
format" data, plotting is suitable for "wide format" data. monthly_cor <- cor(correlation_df$n_incidents,
We would like one column to represent the dates to plot correlation_df$n_calls)
time series results, one column to represent the counts, monthly_cor
and one column to map each observation to either of There are cases in which it is useful to sub-set
them. This enables them to move each x, y, and color knowledge dependent on another set of values when
statement to a every single column. To describe the dealing with relational datasets. Filtering joins are an
columns for events and calls, we want the main column. alternative form of join that helps one to retain all
The value column specifies the corresponding counts for unique cases within a data frame while maintaining the
each of these variables. The outcome is a long and data frame structure itself. It would be useful to provide
narrow data frame with several rows for each date all the data from every incident identified by police and
observation by keeping the Date column off the key. every civilian call on their shared dates so that from
each dataset we can measure identical figures and
# Gather into long format using the "Date" column to compare outcomes. In this case, we can use shared dates
define observations
2
Data Science for Social Good: Crime Study
to subset all the complete data frames for calls and chance crimes in which unattended cars are broken into.
events. However, this is possibly just the case where the
# Subset calls to police by shared_dates location of a so-called-in-crime is equal to the location
calls_shared_dates <- semi_join(calls, shared_dates, by of the incident of the crime. Let's search to see if the
= c("Call Date" = "Date")) positions of the most common reported citizen crime
glimpse(calls_shared_dates) and reported crime by the police are close.
# Perform a sanity check that we are using this filtering # Aggregate the number of reported incidents by Crime
join function appropriately TYPE
sort(unique(shared_dates$Date))== incidents_types <- incidents %>%
sort(unique(calls_shared_dates$"Call Date")) count(Descript, sort = TRUE) %>%
# Filter recorded incidents by shared_dates rename( CrimeType = "Descript", n_incidents = n)
incidents_shared_dates <- filter(incidents, Date %in% #Aggregated incidents dataset
shared_dates$Date) incidents_types
glimpse(incidents_shared_dates)
# Perform a sanity check that we are using this filtering glimpse(calls)
join function appropriately # Aggregate the number of calls for police service by
sort(unique(shared_dates$Date))==sort(unique(incidents Crime TYPE
_shared_dates$Date)) calls_crime_types <- calls %>%
count(`Original Crime Type Name`, sort = TRUE) %>
Now, after entering the data sets, we need to see what %
the data looks like. We have previously generated a rename(CrimeType = `Original Crime Type Name` ,
scatterplot and applied a linear model to the data to see n_calls = n )
how the number of calls and the frequency of recorded # Aggregated calls dataset
events have trended over time. Scatterplots are a good calls_crime_types
way to look at general continuous data patterns.
However, to see patterns in categorical results, in order # Join data frames to create a new "mutated" set of
to consider their degrees of significance, we need to information
imagine the ranked order of the variables. shared_crime_types <- calls_crime_types%>%
# Create a bar chart of the number of calls for each semi_join(incidents_types, by="CrimeType")
crime # Take a glimpse of this new data frame
plot_calls_freq <- calls_shared_dates %>% shared_crime_types
count(`Original Crime Type Name`) %>%
top_n(10, n) %>% # Arrange the top 10 locations of called in crimes in a
ggplot(aes(y = reorder(`Original Crime Type Name`, new variable
n), x = n)) + location_calls <- calls_shared_dates %>%
geom_bar(stat = "identity") + filter(`Original Crime Type Name` == "Auto Boost /
xlab("Count") + Strip")%>%
ylab("Crime Description") + count(Address) %>%
ggtitle("Calls Reported Crimes") arrange(desc(n))%>%
# Output the plots top_n(10, n)
plot_calls_freq location_calls
# Create a bar chart of the number of reported incidents
for each crime # Arrange the top 10 locations of reported incidents in a
plot_incidents_freq <- incidents_shared_dates %>% new variable
count(Descript) %>% location_incidents <- incidents_shared_dates %>%
top_n(10, n) %>% filter(Descript == "GRAND THEFT FROM LOCKED
ggplot(aes(x = n, y =reorder(Descript, n) )) + AUTO") %>%
geom_bar(stat = 'identity') + count(Address) %>%
ylab("Crime Description") + arrange(desc(n))%>%
xlab("Count") + top_n(10, n)
ggtitle("Incidents Reported Crimes") location_incidents
# Output the plots
plot_incidents_freq # Print the top locations of each dataset for comparison
location_calls
"GRAND THEFT AUTO" is far and away the felony of location_incidents
highest occurrence. This group potentially captures more # Filter grand theft auto incidents
3
Data Science for Social Good: Crime Study
3. CONCLUSION