Professional Documents
Culture Documents
# Setting The Working Directory: Root - Dir
# Setting The Working Directory: Root - Dir
rm(list = ls())
library(outliers)
set.seed(42)
tail(crimeData)
boxplot(crimeData$Crime)
From
the inital visualizations it seems that there are two outliers
crimeRate <- crimeData$Crime
crimeRate
## [1] 791 1635 578 1969 1234 682 963 1555 856 705 1674 849
511 664 798
## [16] 946 539 929 750 1225 742 439 1216 968 523 1993 342
1216 1043 696
## [31] 373 754 1072 923 653 1272 831 566 826 1151 880 542
823 1030 455
## [46] 508 849
##
## Grubbs test for one outlier
##
## data: crimeData$Crime
## G = 2.81287, U = 0.82426, p-value = 0.07887
## alternative hypothesis: highest value 1993 is an outlier
boxplot(crimeRate2)
Even
though the P value wasn’t below .05, it was visually identi ied so I decided to remove it to
see how it would affect the data visually
After removing the one outlier on Row 26, the next outlier will be indenti ied and then
removed to see how it affects the data
#Looking for outliers again
grubbs.test(crimeRate2, type = 10, opposite = FALSE, two.sided =
FALSE)
##
## Grubbs test for one outlier
##
## data: crimeRate2
## G = 3.06343, U = 0.78682, p-value = 0.02848
## alternative hypothesis: highest value 1969 is an outlier
f
f
boxplot(crimeRate3)
The P-
value of the second point is under .05 with the new data, although the removal of the other
point is undoubtedly decreasing this number, it is still statistically signi icant within this
data set.
Looking at the data visually, the two initally identi ied outliers have been removed. There
was a third point that was close in the inital boxplot that I will look at now to see its
statistical signi icance as an outlier.
grubbs.test(crimeRate3, type = 10, opposite = FALSE, two.sided =
FALSE)
##
## Grubbs test for one outlier
##
## data: crimeRate3
## G = 2.56457, U = 0.84712, p-value = 0.1781
## alternative hypothesis: highest value 1674 is an outlier
f
f
f
Even within the new data set, the P-value is nearly double .05 so it fails the null hypothesis
enough to be rejected. It also doesn’t appear visually to be enough of an outlier to be
removed.
Re ec on
Taking a look at the data from a purely numbers perspective I think the removal of the two
data sources helps to created a source of data that would be able to create more accurate
models for the majority of situations. I do have some concerns about the removal of data as
it applies to this situation. Each data point relates to the data from an entire state. With
what I know about the criminal justice system, prison systems have a decent amount of
autonomy within each state and can tailor their prisons and punnishments at a state level.
While looking at this data federally the removal outliers might be good, removing a data
point makes you lose the data of an entire state. That information could be useful in
showing the effectiveness or ineffectiveness of certain policies within that state, that could
then provide valuable insight into policy changes at a national level
fl
ti