You are on page 1of 13

EK125 Final Project - Boston Crime Activit 2015-2018

Michael Baran , Harrison Peairs, Jordan Marabello

This project is based on a data set of 319,073 initial data points on data provided about crime occurences in Boston. Our intentions with this set of data is
to review the occurences and determine patterns whether increasing, at a standstill or, in decreasing quantities. Using this we can determine the area in
which the Boston Police Department should focus on in upcoming ears in an attempt order to lower overall crime in Boston and help those who inhabit
Boston's communities.

%L ad file f m E cel da a e
C imeDa a = able2 c ( ead able('c ime. l ', 'P e e eVa iableName ', e));
C imeC de = able2 c ( ead able(' ffen e_c de . l ', 'P e e eVa iableName ', e));
% The fi a f c bbing. We em ed he La , L ng, and L ca i n
% field a he e e en' needed f
C imeDa a = mfield(C imeDa a, 'La ');
C imeDa a = mfield(C imeDa a, 'L ng');
C imeDa a = mfield(C imeDa a, 'L ca i n');

This initial scrubbing reads the data sets in and removes three elds that we knew we wouldn't use.

% Rem ed inc m le e da a f m da a e
j=1;
f i = 1:leng h(C imeDa a)
if i em (C imeDa a(i).DISTRICT) && i nan(C imeDa a(i).REPORTING_AREA) && i em (C imeDa a(i).STREET)
C imeDa a2(j) = C imeDa a(i);
j=j+1;
end
end

We saw that our data onl had missing values in the District, Reporting Area, and Street elds. We didn't include the NaNs in shooting because the NaNs
in this eld don't represent incomplete data, but just an indication of no shooting present. We checked these three elds and if data was present in all
three, the row would be moved to a new variable. Our traditional approach of simpl deleting the row wasn't working for us for some reason. This forced
us to make CrimeData2, which contained onl the rows with complete data.

a=1;
b=1;
c=1;
d=1;

%L h gh ini ial da a d c men end e ac da a b ea

f i = 1:leng h(C imeDa a2)


if C imeDa a2(i).YEAR == 2018
C imeDa a18(d) = C imeDa a2(i);
d = d + 1;
el eif C imeDa a2(i).YEAR == 2017
C imeDa a17(c) = C imeDa a2(i);
c = c + 1;
el eif C imeDa a2(i).YEAR == 2016
C imeDa a16(b) = C imeDa a2(i);
b = b + 1;
el eif C imeDa a2(i).YEAR == 2015
C imeDa a15(a) = C imeDa a2(i);
a = a + 1;
end
end

Here, we broke up our data into 4 smaller variables which are separated b ear.

%C ea e ini ial ba g a h di la c ime e ed e ea


Ba Yea X = [2015 2016 2017 2018];
Ba Da aY = [leng h(C imeDa a15) leng h(C imeDa a16) leng h(C imeDa a17) leng h(C imeDa a18)];
b = ba (Ba Yea X, Ba Da aY);
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
i le('C ime B Yea ')
label('Yea ')
label('C ime ')

This bar graph gives us a basic understanding of how man crimes were collected in the data set per ear. This will allow us to make a more detailed
evaluation of the data set and have a more de nitive conclusion. In 2015, there were 51,450 crimes reported in the data set. In 2016, there were 92,046.
In 2017, there were 93,419. In 2018, there were 60,920.

%C ea ing a li f maj c ime


OFFCODELIST = [ 'La cen ' , 'M Vehicle' , 'Mi ing' , 'P e ' , 'P i i n' , 'Ra e' , ...
'R bbe ' , 'Mi ing' , 'D g ' , 'In e iga e' , 'A Thef ' , 'Re iden ial' , 'C mme cial' , ...
'F a d' , 'Fi ea m' , 'P i ne ' , 'Li ' , 'Ha a men ' ];

These are 18 crimes that we think will be commited most often

%C n c ime an i ie e ea

c n = 0;
c ime al e 15 = [];
f j=1:18
f i = 1:leng h(C imeDa a15)
if m(C imeDa a15(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELIST j (1:4))==4
c n = c n +1;
end
end
c ime al e 15 = [c ime al e 15 c n ];
c n = 0;
end

c n = 0;
c ime al e 16 = [];
f j=1:18
f i = 1:leng h(C imeDa a16)
if m(C imeDa a16(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELIST j (1:4))==4
c n = c n +1;
end
end
c ime al e 16 = [c ime al e 16 c n ];
c n = 0;
end

c n = 0;
c ime al e 17 = [];
f j=1:18
f i = 1:leng h(C imeDa a17)
if m(C imeDa a17(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELIST j (1:4))==4
c n = c n +1;
end
end
c ime al e 17 = [c ime al e 17 c n ];
c n = 0;
end

c n = 0;
c ime al e 18 = [];
f j=1:18
f i = 1:leng h(C imeDa a18)
if m(C imeDa a18(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELIST j (1:4))==4
c n = c n +1;
end
end
c ime al e 18 = [c ime al e 18 c n ];
c n = 0;
end

With these four sets of loops, we iterated through our four structs for the ears, counted up the number of crimes based on the t pe of crime. We did this
b using a cell arra that held the general offense codes as well as new arra s, crimevalues, that hold the values for number of crimes

%Me ge all da a ne ma i
C imeT e = [c ime al e 15/leng h(C imeDa a15); c ime al e 16/leng h(C imeDa a16); ...
c ime al e 17/leng h(C imeDa a17);c ime al e 18/leng h(C imeDa a18)]*100;
C imeT e = nd(C imeT e,2);

%Di la c ime a e cen age f each ea


cl e all
f i = 1:4
b l (2,2,i);
b = ba (Ba Yea X,C imeT e(:,i));
i le(OFFCODELIST i );
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
lim([0 100])
label('Yea ')
label('C ime Pe cen age')
end

cl e all
jj=5;
f i = 1:4
b l (2,2,i);
b=ba (Ba Yea X,C imeT e(:,jj));
i le(OFFCODELIST jj );
jj=jj+1;
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
lim([0 100])
label('Yea ')
label('C ime Pe cen age')
end

cl e all
jj=9;
f i = 1:4
b l (2,2,i);
b = ba (Ba Yea X,C imeT e(:,jj));
i le(OFFCODELIST jj );
jj=jj+1;
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
lim([0 100])
label('Yea ')
label('C ime Pe cen age')
end

cl e all
jj=13;
f i = 1:4
b l (2,2,i);
b = ba (Ba Yea X,C imeT e(:,jj));
i le(OFFCODELIST jj );
jj=jj+1;
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
lim([0 100])
label('Yea ')
label('C ime Pe cen age')
end

cl e all
jj=17;
f i = 1:2
b l (2,2,i);
b=ba (Ba Yea X,C imeT e(:,jj));
i le(OFFCODELIST jj );
jj=jj+1;
i 1 = b(1).XEndP in ;
i 1 = b(1).YEndP in ;
label 1 = ing(b(1).YDa a);
e ( i 1, i 1,label 1,'H i n alAlignmen ','cen e ',...
'Ve icalAlignmen ','b m')
lim([0 100])
label('Yea ')
label('C ime Pe cen age')
end
These are the plots for 18 crimes that we selected, the most occuring crimes. These plots show the percentage of how often these crimes were
commited per ear in comparisn to total crime in Boston. To keep all plots consistent, we used the same axes of 0 to 100.

%Pie cha f 5 c ime


cl e all
C imePie = [( m(C imeT e(:,1))/4) (( m(C imeT e(:,2))/4)) (( m(C imeT e(:,4))/4)) (( m(C imeT e(:,9))/4)) ...
(( m(C imeT e(:,10))/4))];
C imePie = [C imePie (100- m(C imePie))];
label = 'La cen ', 'M Vehicle', 'P e ', 'D g ', 'In e iga i n', 'O he ' ;
ie(C imePie)
legend( label )

This pie chart shows the top ve most common crimes out of the eighteen we picked. 'Other' is the amount of crimes that were not one of these ve.

%C llec ing am n f maj c ime f m all ea

OFFCODELISTCOMP = [OFFCODELIST(1) OFFCODELIST(2) OFFCODELIST(4) OFFCODELIST(9) OFFCODELIST(10)];


m n hda a = e (6,48);
ccc = 1;
%P edic
f i=1:48
m n hda a(1,i) = ccc;
ccc=ccc+1;
end
f j=1:5
f i = 1:leng h(C imeDa a15)
if m(C imeDa a15(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELISTCOMP j (1:4))==4
f k = 1:12
if C imeDa a15(i).MONTH == k
m n hda a(j+1,k) = m n hda a(j+1,k) + 1;
end
end
end
end
end

f j=1:5
f i = 1:leng h(C imeDa a16)
if m(C imeDa a16(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELISTCOMP j (1:4))==4
f k = 13:24
if C imeDa a16(i).MONTH == k - 12
m n hda a(j+1,k) = m n hda a(j+1,k) + 1;
end
end
end
end
end

f j=1:5
f i = 1:leng h(C imeDa a17)
if m(C imeDa a17(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELISTCOMP j (1:4))==4
f k = 25:36
if C imeDa a17(i).MONTH == k - 24
m n hda a(j+1,k) = m n hda a(j+1,k) + 1;
end
end
end
end
end

f j=1:5
f i = 1:leng h(C imeDa a18)
if m(C imeDa a18(i).OFFENSE_CODE_GROUP(1:4) == OFFCODELISTCOMP j (1:4))==4
f k = 37:48
if C imeDa a18(i).MONTH == k - 36
m n hda a(j+1,k) = m n hda a(j+1,k) + 1;
end
end
end
end
end

In this code, we created a new variable with onl 5 of the crimes committed most often. We then used this data to plot and asses the crime correlation
per month using the linear regression learner.

% Re- c bbed da a
m n hda a = m n hda a(:,6:45);

Because no data from Januar -Ma of 2015 and October-December of 2018 was collected, we needed to remove these columns so that the would not
affect our linear regression learner.
These plots demonstrate the data for crimes involving investigation b month. As we can see b the data provided b the linear regression learner, the r
squared value is 0.37 which means that 37% of the data can be accounted for b the linear regression equation. Using the r squared value, we can
obtain the r value, which is 0.608. This demonstrates of the linear regression line. A somewhat strong positive linear correlation.
These plots demonstrate data for drug related crimes b month. As we can see b the data provided b the linear regression learner, the r squared value
is 0.22 which means that 22% of the data can be accounted for b the linear regression equation. Using the r squared value, we can obtain the r value,
which is 0.469. This demonstrates of the linear regression line. A somewhat weak positive linear correlation.
These plots demonnstrate the data for propert crimes for each month. As we can see b the data provided b the linear regression learner, the r
squared value is -0.16 which means that the chosen model does not follow the trends of the data and ts worse than a hori ontal line. Due to a straight
line being a better representation of the data, the data is unchanging, which means the crime related to propert has remained constant over these four
ears.
These plots demonstrate the data for motor vehicle crimes for each month. As we can see b the data provided b the linear regression learner, the r
squared value is -0.16 which means that the chosen model does not follow the trends of the data and ts worse than a hori ontal line. Due to a straight
line being a better representation of the data, the data is unchanging, which means the crime related to motor vehicles has remained constant over these
four ears.
These plots demonstrate the data for larcen crimes for each month. As we can see b the data provided b the linear regression learner, the r squared
value is 0.03 which means that the chosen model follows the trend of a hori ontal line. Due to a straight line being the representation of the data, the
data is unchanging, which means the crime related to larcen has remained almost constant over these four ears.

Conclusion:

Based on the data provided in the data sets, it is clear that the crimes that occured the most over the timeframe between 2015 to 2018 were related to
larcen , motor vehicle theft, drugs, investigation, and propert . Within these, we can see that crimes related to investigation and drugs have increased
over this timeframe: motor vehicle theft, larcen , and propert crimes have remained at a standstill, neither increasing nor decreasing. We believe that to
counter these increases or standstills in the aforementioned crimes the police forces in Boston should put more emphasis on these crimes in an attempt
to lower them.

Possible Sources of biases: https://towardsdatascience.com/5-t pes-of-bias-how-to-eliminate-them-in- our-machine-learning-project-75959af9d3a0

This article goes through ve of the most common t pes of biases that are seen in data sets in general, as well as machine learning programs. The
article starts off with some important basic facts about wh and how bias can even be in programs, which are just lines of logical and rational code. As
explained in the article, most bias in programs come from unconscious creator prejudice and read -made algorithms. An wa s, the rst common t pe of
bias seen is sample bias. This is where the data collected does not full represent the environment for which the program will run in. The given example
is securit cameras used in da time that onl have been given nighttime data, and the wa to protect against this bias is b having as diverse a data set
as possible. The second common t pe of bias is exclusion bias, where data-scrubbing removes important data, since the creator bias-l believes it is
irrelevant. This is a ver common t pe for an level of programmers, including ourselves, since we have to make those data-scrubbing decisions based
solel on our opinions. One wa to protect against this bias in programs is to anal e each feature individuall before deciding what is important and
what isn't. The third common t pe of bias is observer bias. One wa to think of this bias could be with our dataset on crime in Boston. Since we all live in
Boston for college, we all likel have certain thoughts and prejudices about where crimes would be most prevalent, and these prejudices can cause us to
look for trends that might not be supported full b the data. The onl real solution to this bias is b being aware of prejudices and not acting on them
when programming. The fourth common bias is prejudice bias, where the same stereot ping that exists in real life due to prejudiced data it is fed. The
solution to this is making data sets more equal, as well as the algorithms. The last common bias described in this article is measurement bias, where the
data collection device has something wrong with it, creating a skewed data set, from mechanical error instead of human error. The eas solution is to test
the devices beforehand and make sure the are accurate.

Link to data set: https://www.kaggle.com/Anal eBoston/crimes-in-boston?select=crime.csv

You might also like