Econ 104 Project 1 Ace Team 88

Cross-Sectional Dataset Regression Modeling
Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim
10/20/2023
1. Briefly d iscuss t he q uestion y ou a re t rying t o a nswer w ith y our model.
The question we are trying to answer with the “MurderRates” dataset is: do the explanatory variables
(convictions, executions, time, income, lfp, noncauc, and southern) have a statistically significant e ffect on
the response variable, murder rate per 100,000 ?
7 Explanatory/Independent Variables Convictions: Number of convictions divided by number of murders in
1950. Executions: Average number of executions during 1946–1950 divided by convictions in 1950. Time:
Median time served (in months) of convicted murderers released in 1951. Income: Median family income
in 1949 (in 1,000 USD). Lfp: Labor force participation rate in 1950 (in percent). Noncauc: Proportion of
population that is non-Caucasian in 1950. Southern: Factor indicating region.
2. Give a description of your dataset including:
(a)
Maddala (2001), Table 8.4, p. 330 Maddala, G.S. (2001). Introduction to Econometrics, 3rd ed. New York:
John Wiley.
library(AER)
data("MurderRates")
(b)
The “MurderRates” dataset is a cross-section data on states in 1950 and is a data frame containing 44
observations on 8 variables. The murder rate response variable is scaled to murder rate per 100,000 people.
The dataset seeks to give insight into possible factors that may or may not increase the murder rate. The
factors include convictions rate, executions rate, median incarceration time, median family income, labor
force participation, ethnicity, and region.
(c)
Histograms of the Variables
par(mfrow=c(2,2))
hist(MurderRates$rate, main = " Histogram of Murder Rates",
xlab = "Murder Rates per 100,000", probability = TRUE)
1
lines(density(MurderRates$rate), col = "blue", lwd = 4)
hist(MurderRates$convictions, main = "Histogram of Convictions",
xlab = "Convictions in 1950", probability = TRUE)
lines(density(MurderRates$convictions), col = "blue", lwd= 4)
hist(MurderRates$executions, main = "Histogram of Executions",
xlab = "Executions during 1946-1950", probability = TRUE)
lines(density(MurderRates$executions), col = "blue", lwd = 4)
hist(MurderRates$time, main = "Histogram of Median Time of Incarceration",
xlab = "Median Time Served (in months) in 1950", probability = TRUE)
lines(density(MurderRates$time), col = "blue", lwd = 4)
Histogram of Murder Rates Histogram of Convictions

0.10
Density
Density
2
0.00
0 5 10 15 20 0 0.1 0.3 0.5 0.7
Murder Rates per 100,000 Convictions in 1950
Histogram of Executions Histogram of Median Time of Incarceration

0.005
Density
Density
8
4
0.000
0
0.0 0.1 0.2 0.3 0.4 0 50 100 200 300
Executions during 1946−1950 Median Time Served (in months) in 1950
hist(MurderRates$income, main = "Histogram of Median Family Income",

xlab = "Median Family Income in 1949 (in 1,000 USD)", probability = TRUE)
lines(density(MurderRates$income), col = "blue", lwd = 4)
hist(MurderRates$lfp, main = "Histogram of Labor Force Participation (LFP)",
xlab = "LFP (in percent) in 1950", probability = TRUE)
lines(density(MurderRates$lfp), col = "blue", lwd = 4)
hist(MurderRates$noncauc, main = "Histogram of the Proportion of
Population that is Non-Caucasian",
xlab = "Proportion of Population that is Non-Caucasian in 1950", probability = TRUE)
lines(density(MurderRates$noncauc), col = "blue", lwd = 4)
library(dplyr)
2
Histogram of Median Family Income Histogram of Labor Force Participation (LFP)
0.15
Density
Density
0.6
0.00
0.0
1.0 1.5 2.0 46 48 50 52 54 56 58 60
Median Family Income in 1949 (in 1,000 USD) LFP (in percent) in 1950
Histogram of the Proportion of

Population that is Non−Caucasian
8
Density
4
0
0.0 0.1 0.2 0.3 0.4 0.5
Proportion of Population that is Non−Caucasian in 1950
count(MurderRates, southern)
## southern n
## 1 no 29
## 2 yes 15
Five-Number Summary and Boxplots of the Variables
summary(MurderRates$rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.810 1.808 3.625 5.404 7.725 19.250
summary(MurderRates$convictions)

## 0.1080 0.1663 0.2260 0.2605 0.3202 0.7570
summary(MurderRates$executions)

## 0.00000 0.02625 0.04500 0.06034 0.08225 0.40000
3
summary(MurderRates$time)

## 34.0 94.0 124.0 136.5 179.0 298.0
summary(MurderRates$income)

## 0.760 1.550 1.830 1.781 2.070 2.390
summary(MurderRates$lfp)

## 47.00 51.50 53.40 53.07 54.52 58.80
summary(MurderRates$noncauc)

## 0.00300 0.02175 0.06450 0.10559 0.14450 0.45400
summary(MurderRates$southern)
## no yes
## 29 15
par(mfrow=c(2,2))
boxplot(MurderRates$rate, main = "Boxplot of Murder Rates")
boxplot(MurderRates$convictions, main = "Boxplot of Convictions")
boxplot(MurderRates$executions, main = "Boxplot of Executions")
boxplot(MurderRates$time, main = "Boxplot of Median Time Served")
4
Boxplot of Murder Rates Boxplot of Convictions
0.1 0.4 0.7

15
5
Boxplot of Executions Boxplot of Median Time Served

0.4
200
0.2
50
0.0
boxplot(MurderRates$income, main = "Boxplot of Median Family Income")

boxplot(MurderRates$lfp, main = "Boxplot of LFP Rate")
boxplot(MurderRates$noncauc, main = "Boxplot of Non-Caucasian")
5
Boxplot of Median Family Income
2.0
Boxplot of LFP Rate
54
1.0
48
Boxplot of Non−Caucasian
0.3
0.0
Correlation Matrix of the Variables
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
cornew <- MurderRates[, -8]

correlation <- cor(cornew)
correlation
## rate convictions executions time income

## rate 1.0000000 -0.251134671 0.17279303 -0.518584162 -0.65427652
## convictions -0.2511347 1.000000000 -0.21352746 0.004829982 0.06577502
## executions 0.1727930 -0.213527464 1.00000000 0.079556439 0.03783757
## time -0.5185842 0.004829982 0.07955644 1.000000000 0.30356188
## income -0.6542765 0.065775022 0.03783757 0.303561875 1.00000000
## lfp -0.1827364 -0.172351045 0.29625365 0.154922595 0.55790866
## noncauc 0.7486359 -0.184331213 0.22077998 -0.335964017 -0.66062442
## lfp noncauc
## rate -0.1827364 0.7486359
## convictions -0.1723510 -0.1843312
6
## executions 0.2962537 0.2207800
## time 0.1549226 -0.3359640
## income 0.5579087 -0.6606244
## lfp 1.0000000 -0.1507521
## noncauc -0.1507521 1.0000000
par(mfrow=c(1,1))
corrplot(correlation, method = "circle")
convictions
executions
noncauc
income
time
rate
lfp
1
rate
0.8
convictions 0.6
0.4
executions
0.2
time 0
−0.2
income
−0.4
lfp −0.6
−0.8
noncauc
−1
corrplot(correlation, method = "number")
7
convictions
executions
noncauc
income
time
rate
lfp
1
rate 1.00 −0.25 0.17 −0.52 −0.65 −0.18 0.75
0.8
convictions −0.25 1.00 −0.21 0.00 0.07 −0.17 −0.18 0.6
0.4
executions 0.17 −0.21 1.00 0.08 0.04 0.30 0.22
0.2
time −0.52 0.00 0.08 1.00 0.30 0.15 −0.34 0
−0.2
income −0.65 0.07 0.04 0.30 1.00 0.56 −0.66
−0.4
lfp −0.18 −0.17 0.30 0.15 0.56 1.00 −0.15 −0.6
−0.8
noncauc 0.75 −0.18 0.22 −0.34 −0.66 −0.15 1.00
−1
Analysis: Based on the histograms and five-number summary statistics, all variables, except the Labor Force
Participation and Median Family Income are right-skewed, as the mean values are greater than the median
values. Perhaps a transformation such as a log can make the distribution normal. The Histogram of Median
Family Income is left-skewed, as the mean value is smaller than the median. The Histogram for Labor Force
Participation appears to be normally distributed, and the mean and median are similar in value. As seen
from the boxplots, except for the Median Time Served and Labor Force Participation Rate, the variables
tend to have 1-3 outliers. Percent composition for southern variable shows that the group “yes” which means
southern region observations is more prevalent, making up 29/44 or about 65.9% of the total observations,
while the group “no” accounts for the remaining 15/44 or about 34.1%.
(d) Possible Violation of the Regression Assumptions
Seen through the high level of outliers in the boxplots (1 in murderrates, 3 in convictions, 2 in executions,
2 in noncauc) and right-skewed histograms, there seems to be non-normality and thus heteroskedasticity.
Thus, the functional form may be misspecified and, therefore, may need transformations such as logs or
addition of squared/interaction terms. The correlation plot shows that the correlation between variables are
not below -80% or above 80%, which indicates that variables are not highly correlated with each other. High
correlation between variables could lead to incorrect models, so what we see is positive.
8
3. Estimate a multiple linear regression model that includes main effects only
(i.e. no interactions or higher order terms). This is our baseline model.
(a)
reg1 <- lm(rate~convictions+executions+time+income+lfp+noncauc+southern,

data = MurderRates)
summary(reg1)
##
## Call:
## lm(formula = rate ~ convictions + executions + time + income +
## lfp + noncauc + southern, data = MurderRates)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9913 -1.1943 -0.3538 1.2383 6.5574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44436 9.96694 0.045 0.9647
## convictions -4.33938 2.78313 -1.559 0.1277
## executions 2.85276 6.12313 0.466 0.6441
## time -0.01547 0.00705 -2.194 0.0348 *
## income -2.50013 1.68519 -1.484 0.1466
## lfp 0.19357 0.20614 0.939 0.3540
## noncauc 10.39903 5.40610 1.924 0.0623 .
## southernyes 3.26216 1.32980 2.453 0.0191 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.459 on 36 degrees of freedom
## Multiple R-squared: 0.7459, Adjusted R-squared: 0.6965
## F-statistic: 15.1 on 7 and 36 DF, p-value: 5.105e-09
par(mfrow=c(2,2))
plot(reg1, main = "Residuals vs Fitted for Reg 1")
9
Residuals vs Fitted for Reg 1 Residuals vs Fitted for Reg 1
Standardized residuals
Residuals vs Fitted Normal Q−Q
0 4 8
1 28 1
Residuals
37
2
−2 0
−6
35 35
0 5 10 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Residuals vs Fitted for Reg 1 Residuals vs Fitted for Reg 1

Scale−Location Residuals vs Leverage
28 1 1
35 28
2
1.0
0.5
0
Cook's distance 1
0.0
−3
35
0 5 10 0.0 0.2 0.4 0.6
Fitted values Leverage
Fc1 <- qf(0.95, 7, 36)

Fc1
## [1] 2.277143
Statistical Significance
Looking at the regression output, we can see that the intercept and variables convictions, executions, income,
lfp, and noncauc are not statistically significant at the 5% level as they all have p-values greater than 0.05.
Only two variables, time and southern, are statistically significant at the 5% level as they both have p-values
less than 0.05. The majority of the explanatory variables may not have statistical significance because of
the small sample size of 44. Thus, as the sample increases, the results may change.
Unrealistic Signs and/or Magnitudes
Since the p-values of time and southern are less than 0.05 at the 5% significance level, we can deduce that
the median time served and southern regional factors have statistical significance. Due to the low sample
size, there may be insufficient observations to find statistical significance accurately. The estimates for
convictions, income, time, noncauc, and southern make sense. As there are increased convictions, we expect
less incidents of murder due to risks of punishment and imprisonment. Our reasoning is - the longer time
served, the less likely it is for previously convicted murderers to commit murder again. The wealthier people
are, which also tends to be Caucasians, the less likely they are to commit murder. Lastly, southern states
are likely to have higher murder rates due to gun culture and higher poverty. Yet, estimates for labor force
participation and execution are surprising because they were positive coefficients. We would expect that if
10
people are employed, they are less likely to have the time to commit murder. Moreover, we expect that if
executions per convictions increase, that means the cost of committing murder increases, so we expect that
there would be an inverse relationship. These anomalies and unrealistic magnitudes may occur due to the
low sample size of 44. The intercept of 0.44436 could be realistic but smaller in reality for the US.
(b)
Looking at the regression output, R-squared is 0.7459, which is considerably high. This means that all the
explanatory variables explain approximately 74.59 percent of murder rate. This gives us confidence that the
model is a good start, but many more tests need to be performed to better ascertain the fit of the model.
The scatterplot of the model’s residuals gives us a small concern as there are a few outliers that make the
variance inconsistent. This may come from the fact mentioned in 1(d) that heteroskedasticity may be present
due to the nature of its cross-sectional data. However, most of the residuals look consistent and the p-value
of the model is approximately zero, which gives us confidence in the model as constructed. In terms of the
overall significance of the model, since the F-stat given in the regress output (15.1) is greater than Fc1, we
reject the null hypothesis and conclude that the explanatory variables all have joint significance on murder
rate. We also see this through the p value which is less than 0.05 at the 5% significance level.
FEATURE SELECTION
4. Test the model in (3) for multicollinearity using VIF. Based on this test
remove the appropriate variables and estimate a new regression model based on
these findings. Be sure to justify your reason/criteria for removal.
library(broom)
library(knitr)
library(car)
vif1 <- vif(reg1)
vif1
## convictions executions time income lfp noncauc

## 1.106062 1.255795 1.341514 3.168240 1.858710 2.700726
## southern
## 2.891300
Based on the results given above, all the explanatory variables have VIF’s that are less than 5. This is a
good sign that the explanatory variables do not have significant correlation or explanatory power to each
other. Thus, none of the variables need to be eliminated.
5. Using AIC or Schwartz Criterion, determine which subset of predictors you

will keep and generate a new model. Comment on the performance of this model
compared to the one in (3).
beststep <- step(reg1, direction ="backward")
## Start: AIC=86.35
## rate ~ convictions + executions + time + income + lfp + noncauc +
11
## southern
##
## Df Sum of Sq RSS AIC
## - executions 1 1.312 218.99 84.613
## - lfp 1 5.331 223.01 85.413
## <none> 217.68 86.348
## - income 1 13.309 230.99 86.960
## - convictions 1 14.699 232.38 87.224
## - noncauc 1 22.373 240.05 88.653
## - time 1 29.113 246.79 89.872
## - southern 1 36.388 254.07 91.150
##
## Step: AIC=84.61
## rate ~ convictions + time + income + lfp + noncauc + southern
##
## - lfp 1 7.006 226.00 83.999
## <none> 218.99 84.613
## - income 1 12.666 231.66 85.087
## - convictions 1 16.056 235.05 85.726
## - noncauc 1 25.037 244.03 87.376
## - time 1 27.856 246.85 87.881
## - southern 1 38.946 257.94 89.815
##
## Step: AIC=84
## rate ~ convictions + time + income + noncauc + southern
##
## - income 1 6.369 232.37 83.222
## <none> 226.00 83.999
## - convictions 1 21.255 247.25 85.954
## - time 1 28.030 254.03 87.143
## - southern 1 35.330 261.33 88.390
## - noncauc 1 39.586 265.58 89.101
##
## Step: AIC=83.22
## rate ~ convictions + time + noncauc + southern
##
## <none> 232.37 83.222
## - convictions 1 20.420 252.79 84.928
## - time 1 26.694 259.06 86.006
## - noncauc 1 57.408 289.77 90.936
## - southern 1 60.181 292.55 91.355
summary(beststep)
##
## Call:
## lm(formula = rate ~ convictions + time + noncauc + southern,
## data = MurderRates)
##
## Residuals:
## Min 1Q Median 3Q Max
12
## -4.5200 -1.5375 -0.4578 1.4669 6.6747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.885154 1.434504 4.103 0.000201 ***
## convictions -4.974470 2.687056 -1.851 0.071714 .
## time -0.014578 0.006887 -2.117 0.040723 *
## noncauc 14.520544 4.677909 3.104 0.003546 **
## southernyes 3.729056 1.173343 3.178 0.002899 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.441 on 39 degrees of freedom
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7009
## F-statistic: 26.2 on 4 and 39 DF, p-value: 1.357e-10
Based on the backward selection above and extraction of the lowest AIC model (83.22), we conclude that the
better model has these explanatory variables: convictions, time, noncauc, and southern. All the variables in
the new model except for convictions is statistically significant at the 5% level, as seen by their respective
p-values being less than 0.05. Convictions is statistically significant at the 10% level. Compared to the
baseline model that had the majority of the parameter estimates that were not statistically significant, the
new model has all of its parameter estimates that are statistically significant with the exception of one at the
5% significance level. This encourages us to think the new model performs better than the baseline model.
6. Using the model in (5) plot the residuals versus its fitted values, ^y and
comment on your results.
residualPlot(beststep, main = "Residual Plot for Beststep Model")
13
Residual Plot for Beststep Model
6
4
Pearson residuals
2
0
−2
−4
0 2 4 6 8 10 12
Fitted values
Based on the residual plot, the model created in #5 appears to show heteroskedasticity because the residuals
show unequal variance, and there is increasing variance in residuals at higher fitted values.
7. Perform a RESET test on the model in (5) and comment on the results.
resettest(beststep, power=2, type="fitted")
##
## RESET test
##
## data: beststep
## RESET = 1.8453, df1 = 1, df2 = 38, p-value = 0.1823
Based on the results of the RESET test, the p-values of 0.1823 and 0.2817 are both greater than 0.05.
Thus, we fail to reject the null at the 5% significance level. The squared and interaction terms do not have
statistical significance, so they do not improve the model and should not be included.
14
8. Using the appropriate method learnt in class, test the model in (4) for het-
eroskedasticity and comment on the conclusion. If it is present, correct the
model before moving on. Based on the results in (c) or (d), this might be help-
ful in transforming the model in the event that its functional form presents an
issue.
GQ test
gqtest(beststep, point=0.5, alternative="greater")
##
## Goldfeld-Quandt test
##
## data: beststep
## GQ = 1.0132, df1 = 17, df2 = 17, p-value = 0.4894
## alternative hypothesis: variance increases from segment 1 to 2
Based on the GQ test, since the p value 0.489 is greater than 0.05, we fail to reject the null hypothesis that
heteroskedasticity is not present at the 5% significance level. There is insufficient evidence for heteroskedas-
ticity.
9. Using a combination of the results from the previous steps, estimate a model
based on your findings which includes interaction terms or higher power terms
(if necessary). You may need to use forward or backward selection for this.
Comment on the performance of this model. compared to your other models.
Make sure to use AIC and Schwartz criterion for model comparison.
#AIC and BIC test

kable(glance(reg1), digits=2)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.75 0.7 2.46 15.1 0 7 - 213.22 229.27 217.68 36 44
97.61
kable(glance(beststep), digits=2)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.73 0.7 2.44 26.2 0 4 - 210.09 220.79 232.37 39 44
99.04
Since we failed to reject the null for the reset in Q7, we would not add squared or interactino terms. Since
we failed to reject the null for the GQ test in Q8, we would not need to correct for heteroskedasticity.Based
on the AIC and Schwartz criterion for model comparison between our original reg1 and new regression
“beststep”, the AIC and BIC both decrease for our new regression. This tells us that our new regression
model is a better fit.
15
10. Provide a short 1 paragraph summary of your overall conclusion, findings,
and limitations not previously stated above.
The question we were initially trying to answer was: do the explanatory variables (convictions, executions,
time, income, lfp, noncauc, and southern) have a statistically significant effect on murder rate per 100,000?
After polishing the model with the steps function, we left only the statistically significant variables: “con-
victions, time, noncauc, and southern”. We did not add any interaction or squared terms due to the reset
test results. We also did not have to correct for multi-collinearity from the VIF test results. The R squared
also increased and the AIC and BIC both decreased, indicating that our new model is a better fit of the
variables. However, there are some limitations. The sample size of 44 is too meager, which makes our
inferences susceptible to variations and incorrect generalizations. Moreover, the histograms for our variables
indicate non-normality for most of the variables. We were unable to normalize (ie. log) these variables
due to negative values appearing as undefined in R. The non-normality and few outliers we saw in each
variable could negatively affect the accuracy of our inferences. The residual plot also shows some indication
of heteroskedasticity. These limitations could potentially be fixed with a larger sample size.
16

Econ 104 Project 1 Ace Team 88

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econ 104 Project 1 Ace Team 88

Uploaded by

Copyright:

Available Formats

Cross-Sectional Dataset Regression Modeling

Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim

1. Briefly d iscuss t he q uestion y ou a re t rying t o a nswer w ith y our model.

2. Give a description of your dataset including:

Histograms of the Variables

Histogram of Murder Rates Histogram of Convictions

0 5 10 15 20 0 0.1 0.3 0.5 0.7

Murder Rates per 100,000 Convictions in 1950

Histogram of Executions Histogram of Median Time of Incarceration

0.0 0.1 0.2 0.3 0.4 0 50 100 200 300

Executions during 1946−1950 Median Time Served (in months) in 1950

hist(MurderRates$income, main = "Histogram of Median Family Income",

1.0 1.5 2.0 46 48 50 52 54 56 58 60

Histogram of the Proportion of

0.0 0.1 0.2 0.3 0.4 0.5

Proportion of Population that is Non−Caucasian in 1950

Five-Number Summary and Boxplots of the Variables

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

0.1 0.4 0.7

Boxplot of Executions Boxplot of Median Time Served

boxplot(MurderRates$income, main = "Boxplot of Median Family Income")

Correlation Matrix of the Variables

## corrplot 0.92 loaded

cornew <- MurderRates[, -8]

## rate convictions executions time income

corrplot(correlation, method = "number")

convictions −0.25 1.00 −0.21 0.00 0.07 −0.17 −0.18 0.6

time −0.52 0.00 0.08 1.00 0.30 0.15 −0.34 0

lfp −0.18 −0.17 0.30 0.15 0.56 1.00 −0.15 −0.6

(d) Possible Violation of the Regression Assumptions

reg1 <- lm(rate~convictions+executions+time+income+lfp+noncauc+southern,

Fitted values Theoretical Quantiles

Residuals vs Fitted for Reg 1 Residuals vs Fitted for Reg 1

0 5 10 0.0 0.2 0.4 0.6

Fitted values Leverage

Fc1 <- qf(0.95, 7, 36)

Unrealistic Signs and/or Magnitudes

## convictions executions time income lfp noncauc

5. Using AIC or Schwartz Criterion, determine which subset of predictors you

beststep <- step(reg1, direction ="backward")

residualPlot(beststep, main = "Residual Plot for Beststep Model")

resettest(beststep, power=2, type="fitted")

gqtest(beststep, point=0.5, alternative="greater")

#AIC and BIC test

You might also like