You are on page 1of 9

HW3

# Setting the working directory


knitr::opts_knit$set(root.dir = '/Users/andrew/Documents/GT/ISYE-6501/
Week 3/data')
getwd()

## [1] "/Users/andrew/Documents/GT/ISYE-6501/Week 3"

rm(list = ls())
library(outliers)
set.seed(42)

#make sure wd is set up correctly


crimeData <- read.table("uscrime.txt", stringsAsFactors = FALSE,
header = TRUE)
head(crimeData)

## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq


Prob
## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1
0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4
0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0
0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7
0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4
0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6
0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682

tail(crimeData)




















## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq


Prob
## 42 14.1 0 10.9 5.6 5.4 0.523 96.8 4 0.2 0.107 3.7 4890 17.0
0.088904
## 43 16.2 1 9.9 7.5 7.0 0.522 99.6 40 20.8 0.073 2.7 4960 22.4
0.054902
## 44 13.6 0 12.1 9.5 9.6 0.574 101.2 29 3.6 0.111 3.7 6220 16.2
0.028100
## 45 13.9 1 8.8 4.6 4.1 0.480 96.8 19 4.9 0.135 5.3 4570 24.9
0.056202
## 46 12.6 0 10.4 10.6 9.7 0.599 98.9 40 2.4 0.078 2.5 5930 17.1
0.046598
## 47 13.0 0 12.1 9.0 9.1 0.623 104.9 3 2.2 0.113 4.0 5880 16.0
0.052802
## Time Crime
## 42 12.1996 542
## 43 31.9989 823
## 44 30.0001 1030
## 45 32.5996 455
## 46 16.6999 508
## 47 16.0997 849

#Plotting out the crime data to look for immediate outliers


plot(crimeData$Crime, type = "b")














boxplot(crimeData$Crime)

From
the inital visualizations it seems that there are two outliers
crimeRate <- crimeData$Crime
crimeRate

## [1] 791 1635 578 1969 1234 682 963 1555 856 705 1674 849
511 664 798
## [16] 946 539 929 750 1225 742 439 1216 968 523 1993 342
1216 1043 696
## [31] 373 754 1072 923 653 1272 831 566 826 1151 880 542
823 1030 455
## [46] 508 849

grubbs.test(crimeData$Crime, type = 10, opposite = FALSE, two.sided =


FALSE)

##
## Grubbs test for one outlier






##
## data: crimeData$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier

#1993 is a statistically identified outlier

#Removing that point


crimeRate2 <- crimeRate[-26]

#Plotting out the crime data again to see the change


plot(crimeRate2, type = "b")

boxplot(crimeRate2)









Even
though the P value wasn’t below .05, it was visually identi ied so I decided to remove it to
see how it would affect the data visually
After removing the one outlier on Row 26, the next outlier will be indenti ied and then
removed to see how it affects the data
#Looking for outliers again
grubbs.test(crimeRate2, type = 10, opposite = FALSE, two.sided =
FALSE)

##
## Grubbs test for one outlier
##
## data: crimeRate2
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier





f

f

#Removing that point


crimeRate3 <- crimeRate2[-4]

#Plotting out the crime data again to see the change


plot(crimeRate3, type = "b")

boxplot(crimeRate3)



The P-
value of the second point is under .05 with the new data, although the removal of the other
point is undoubtedly decreasing this number, it is still statistically signi icant within this
data set.
Looking at the data visually, the two initally identi ied outliers have been removed. There
was a third point that was close in the inital boxplot that I will look at now to see its
statistical signi icance as an outlier.
grubbs.test(crimeRate3, type = 10, opposite = FALSE, two.sided =
FALSE)

##
## Grubbs test for one outlier
##
## data: crimeRate3
## G = 2.56457, U = 0.84712, p-value = 0.1781
## alternative hypothesis: highest value 1674 is an outlier


f


f

f

Even within the new data set, the P-value is nearly double .05 so it fails the null hypothesis
enough to be rejected. It also doesn’t appear visually to be enough of an outlier to be
removed.
Re ec on

Taking a look at the data from a purely numbers perspective I think the removal of the two
data sources helps to created a source of data that would be able to create more accurate
models for the majority of situations. I do have some concerns about the removal of data as
it applies to this situation. Each data point relates to the data from an entire state. With
what I know about the criminal justice system, prison systems have a decent amount of
autonomy within each state and can tailor their prisons and punnishments at a state level.
While looking at this data federally the removal outliers might be good, removing a data
point makes you lose the data of an entire state. That information could be useful in
showing the effectiveness or ineffectiveness of certain policies within that state, that could
then provide valuable insight into policy changes at a national level
fl
ti

You might also like