You are on page 1of 5

Gabi Gallo Mr.

Kiker AP Statistics 7th Period 9/29/13 Linear Regression Project I chose violent crime in a city vs. population of that city as my data because its something that affects everyone. This data pertains to my life because the data I find could affect where I choose to go to college or to live when Im older, based on the crime and population of that city. For college I would want to choose a city where the population would be the biggest with the least amount of violent crime possible and this data and the linear regression could help me narrow down my choices for cities.

Violent Crime in a City vs. Population

Number of Violent Crimes 60,000 50,000 40,000 30,000 20,000 10,000 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 Population (People) y = 0.0061x + 613.88 R = 0.9492

Violent Crimes1 Linear (Violent Crimes1)

The scatterplot of my data shows that the correlation between violent crimes in a city vs. the population of that city is linear. The explanatory variable is the population of a city. The more or less people that live in a city, the more or less violent crimes they have, making violent crime the response variable. The regression equation is

. The slope of the equation in this context means that for every additional person that lives in a city, there are .0061 more violent crimes. The y-intercept means when there are 0 people in a city, there should be 613.88 violent crimes. However, this isnt a reasonable

assumption because when there are no people in a city its not possible for there to be any crimes committed. The data for New York City, (8289415, 52993), is an influential point. The slope changes pretty significantly and the y-intercept value changes even more significantly. The slope becomes .0049 without the data for New York City and the y-intercept goes from 613.88 to 1780.4. Such a drastic change in the equation qualifies New York City as an influential point. To find any outliers in the data, I used the equation , where and are the means and s

is the standard deviation. The standard deviation for this data set is 2,064,119.08 and the x mean is 1,539,577 and the y mean is 10,001. There appear to be two potential outliers for this data and those are (8289415, 52993) and (3855122, 18547). For a point to be an outlier its x and y values would have to be greater than 5,667,815 and 4,138,239, respectively. For the first data point, the x value is an outlier but the y value is not, and for the second data point, neither the x nor y values are outliers. This means that neither point is an outlier, because for it to be an outlier within the data, both the x and y values would need to be outliers. The r value for the linear regression equation for the data is .97. Since the r value is almost exactly 1, I know that the data has a strong, positive, linear association. The value is

.949, or 94.9%. This means that 94.9% of the variation in the number of violent crimes in a city can be explained by the regression line on population. Other factors influencing the response variable could be average income of the city, possibly the demographics, or even the literacy levels or graduation rate. Any of these things could contribute to the 5.1% of the variation that isnt explained by the population. When I created a residual plot for my data, the points appeared very scattered and were in no pattern, confirming that the linear regression line was an appropriate model for my data. No other regression line would fit the data besides linear, because the r value is almost exactly one and the residual plot showed no pattern.

If I use the linear regression equation I generated from the data to predict what the number of violent crimes would be in a city with a population of 840,660 people, I can substitute that number in, making the equation . The

predicted number of violent crimes for a city of 840,660 people would be 5741.91 according to the linear regression line. The actual number of crimes for a city of that size is 5,189. I can calculate the residual, or the amount of error between the actual value and the predicted value by using the equation , where y is the actual value and is the predicted value. which

By substituting in those values the equation is

equals -552.91. Because the residual is negative, that means that the linear regression equation I calculated creates an overestimation of the actual value. A career that would use this type of analysis would be law enforcement. They could look at and evaluate the crime rates between the cities and figure out how many officers are needed in each city, based on the predicted number of violent crimes in a city from its population. Law enforcement could use this information to put more officers in larger cities with higher crime rates, to prevent more crimes from happening and to meet the needs of the communities. From my data and the linear regression equation generated from it, I can conclude that the higher the population of a city, the more violent crimes are committed in that city. Although the linear regression equation produced an overestimation, it still gives you the general idea of how many violent crimes there are predicted to be. It is not possible to say that population causes the number of violent crimes, because correlation does not imply causation. Population can be a factor in the change of the response variable but there are many other things that could influence the number of violent crimes, as well. Such as the average income of the city, possibly the

demographics, literacy levels, graduation rate, or any number of things. However, the association suggest that population could have an impact on the number of violent crimes in a city.

Works Cited: "Violent Crime." FBI. FBI, 06 Aug. 2012. Web. 21 Sept. 2013.