You are on page 1of 16

Cross-Sectional Dataset Regression Modeling

Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim

10/20/2023

1. Briefly d iscuss t he q uestion y ou a re t rying t o a nswer w ith y our model.

The question we are trying to answer with the “MurderRates” dataset is: do the explanatory variables
(convictions, executions, time, income, lfp, noncauc, and southern) have a statistically significant e ffect on
the response variable, murder rate per 100,000 ?
7 Explanatory/Independent Variables Convictions: Number of convictions divided by number of murders in
1950. Executions: Average number of executions during 1946–1950 divided by convictions in 1950. Time:
Median time served (in months) of convicted murderers released in 1951. Income: Median family income
in 1949 (in 1,000 USD). Lfp: Labor force participation rate in 1950 (in percent). Noncauc: Proportion of
population that is non-Caucasian in 1950. Southern: Factor indicating region.

2. Give a description of your dataset including:

(a)

Maddala (2001), Table 8.4, p. 330 Maddala, G.S. (2001). Introduction to Econometrics, 3rd ed. New York:
John Wiley.

library(AER)
data("MurderRates")

(b)

The “MurderRates” dataset is a cross-section data on states in 1950 and is a data frame containing 44
observations on 8 variables. The murder rate response variable is scaled to murder rate per 100,000 people.
The dataset seeks to give insight into possible factors that may or may not increase the murder rate. The
factors include convictions rate, executions rate, median incarceration time, median family income, labor
force participation, ethnicity, and region.

(c)

Histograms of the Variables

par(mfrow=c(2,2))
hist(MurderRates$rate, main = " Histogram of Murder Rates",
xlab = "Murder Rates per 100,000", probability = TRUE)

1
lines(density(MurderRates$rate), col = "blue", lwd = 4)
hist(MurderRates$convictions, main = "Histogram of Convictions",
xlab = "Convictions in 1950", probability = TRUE)
lines(density(MurderRates$convictions), col = "blue", lwd= 4)
hist(MurderRates$executions, main = "Histogram of Executions",
xlab = "Executions during 1946-1950", probability = TRUE)
lines(density(MurderRates$executions), col = "blue", lwd = 4)
hist(MurderRates$time, main = "Histogram of Median Time of Incarceration",
xlab = "Median Time Served (in months) in 1950", probability = TRUE)
lines(density(MurderRates$time), col = "blue", lwd = 4)

Histogram of Murder Rates Histogram of Convictions


0.10
Density

Density

2
0.00

0 5 10 15 20 0 0.1 0.3 0.5 0.7

Murder Rates per 100,000 Convictions in 1950

Histogram of Executions Histogram of Median Time of Incarceration


0.005
Density

Density
8
4

0.000
0

0.0 0.1 0.2 0.3 0.4 0 50 100 200 300

Executions during 1946−1950 Median Time Served (in months) in 1950

hist(MurderRates$income, main = "Histogram of Median Family Income",


xlab = "Median Family Income in 1949 (in 1,000 USD)", probability = TRUE)
lines(density(MurderRates$income), col = "blue", lwd = 4)
hist(MurderRates$lfp, main = "Histogram of Labor Force Participation (LFP)",
xlab = "LFP (in percent) in 1950", probability = TRUE)
lines(density(MurderRates$lfp), col = "blue", lwd = 4)
hist(MurderRates$noncauc, main = "Histogram of the Proportion of
Population that is Non-Caucasian",
xlab = "Proportion of Population that is Non-Caucasian in 1950", probability = TRUE)
lines(density(MurderRates$noncauc), col = "blue", lwd = 4)
library(dplyr)

2
Histogram of Median Family Income Histogram of Labor Force Participation (LFP)

0.15
Density

Density
0.6

0.00
0.0

1.0 1.5 2.0 46 48 50 52 54 56 58 60

Median Family Income in 1949 (in 1,000 USD) LFP (in percent) in 1950

Histogram of the Proportion of


Population that is Non−Caucasian
8
Density

4
0

0.0 0.1 0.2 0.3 0.4 0.5

Proportion of Population that is Non−Caucasian in 1950

count(MurderRates, southern)

## southern n
## 1 no 29
## 2 yes 15

Five-Number Summary and Boxplots of the Variables

summary(MurderRates$rate)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.810 1.808 3.625 5.404 7.725 19.250

summary(MurderRates$convictions)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.1080 0.1663 0.2260 0.2605 0.3202 0.7570

summary(MurderRates$executions)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.00000 0.02625 0.04500 0.06034 0.08225 0.40000

3
summary(MurderRates$time)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 34.0 94.0 124.0 136.5 179.0 298.0

summary(MurderRates$income)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.760 1.550 1.830 1.781 2.070 2.390

summary(MurderRates$lfp)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 47.00 51.50 53.40 53.07 54.52 58.80

summary(MurderRates$noncauc)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.00300 0.02175 0.06450 0.10559 0.14450 0.45400

summary(MurderRates$southern)

## no yes
## 29 15

par(mfrow=c(2,2))
boxplot(MurderRates$rate, main = "Boxplot of Murder Rates")
boxplot(MurderRates$convictions, main = "Boxplot of Convictions")
boxplot(MurderRates$executions, main = "Boxplot of Executions")
boxplot(MurderRates$time, main = "Boxplot of Median Time Served")

4
Boxplot of Murder Rates Boxplot of Convictions

0.1 0.4 0.7


15
5

Boxplot of Executions Boxplot of Median Time Served


0.4

200
0.2

50
0.0

boxplot(MurderRates$income, main = "Boxplot of Median Family Income")


boxplot(MurderRates$lfp, main = "Boxplot of LFP Rate")
boxplot(MurderRates$noncauc, main = "Boxplot of Non-Caucasian")

5
Boxplot of Median Family Income
2.0
Boxplot of LFP Rate

54
1.0

48
Boxplot of Non−Caucasian
0.3
0.0

Correlation Matrix of the Variables

library(corrplot)

## corrplot 0.92 loaded

library(ggplot2)

cornew <- MurderRates[, -8]


correlation <- cor(cornew)
correlation

## rate convictions executions time income


## rate 1.0000000 -0.251134671 0.17279303 -0.518584162 -0.65427652
## convictions -0.2511347 1.000000000 -0.21352746 0.004829982 0.06577502
## executions 0.1727930 -0.213527464 1.00000000 0.079556439 0.03783757
## time -0.5185842 0.004829982 0.07955644 1.000000000 0.30356188
## income -0.6542765 0.065775022 0.03783757 0.303561875 1.00000000
## lfp -0.1827364 -0.172351045 0.29625365 0.154922595 0.55790866
## noncauc 0.7486359 -0.184331213 0.22077998 -0.335964017 -0.66062442
## lfp noncauc
## rate -0.1827364 0.7486359
## convictions -0.1723510 -0.1843312

6
## executions 0.2962537 0.2207800
## time 0.1549226 -0.3359640
## income 0.5579087 -0.6606244
## lfp 1.0000000 -0.1507521
## noncauc -0.1507521 1.0000000

par(mfrow=c(1,1))
corrplot(correlation, method = "circle")

convictions

executions

noncauc
income
time
rate

lfp
1
rate
0.8

convictions 0.6

0.4
executions
0.2

time 0

−0.2
income
−0.4

lfp −0.6

−0.8
noncauc
−1

corrplot(correlation, method = "number")

7
convictions

executions

noncauc
income
time
rate

lfp
1
rate 1.00 −0.25 0.17 −0.52 −0.65 −0.18 0.75
0.8

convictions −0.25 1.00 −0.21 0.00 0.07 −0.17 −0.18 0.6

0.4
executions 0.17 −0.21 1.00 0.08 0.04 0.30 0.22
0.2

time −0.52 0.00 0.08 1.00 0.30 0.15 −0.34 0

−0.2
income −0.65 0.07 0.04 0.30 1.00 0.56 −0.66
−0.4

lfp −0.18 −0.17 0.30 0.15 0.56 1.00 −0.15 −0.6

−0.8
noncauc 0.75 −0.18 0.22 −0.34 −0.66 −0.15 1.00
−1

Analysis: Based on the histograms and five-number summary statistics, all variables, except the Labor Force
Participation and Median Family Income are right-skewed, as the mean values are greater than the median
values. Perhaps a transformation such as a log can make the distribution normal. The Histogram of Median
Family Income is left-skewed, as the mean value is smaller than the median. The Histogram for Labor Force
Participation appears to be normally distributed, and the mean and median are similar in value. As seen
from the boxplots, except for the Median Time Served and Labor Force Participation Rate, the variables
tend to have 1-3 outliers. Percent composition for southern variable shows that the group “yes” which means
southern region observations is more prevalent, making up 29/44 or about 65.9% of the total observations,
while the group “no” accounts for the remaining 15/44 or about 34.1%.

(d) Possible Violation of the Regression Assumptions

Seen through the high level of outliers in the boxplots (1 in murderrates, 3 in convictions, 2 in executions,
2 in noncauc) and right-skewed histograms, there seems to be non-normality and thus heteroskedasticity.
Thus, the functional form may be misspecified and, therefore, may need transformations such as logs or
addition of squared/interaction terms. The correlation plot shows that the correlation between variables are
not below -80% or above 80%, which indicates that variables are not highly correlated with each other. High
correlation between variables could lead to incorrect models, so what we see is positive.

8
3. Estimate a multiple linear regression model that includes main effects only
(i.e. no interactions or higher order terms). This is our baseline model.

(a)

reg1 <- lm(rate~convictions+executions+time+income+lfp+noncauc+southern,


data = MurderRates)
summary(reg1)

##
## Call:
## lm(formula = rate ~ convictions + executions + time + income +
## lfp + noncauc + southern, data = MurderRates)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9913 -1.1943 -0.3538 1.2383 6.5574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44436 9.96694 0.045 0.9647
## convictions -4.33938 2.78313 -1.559 0.1277
## executions 2.85276 6.12313 0.466 0.6441
## time -0.01547 0.00705 -2.194 0.0348 *
## income -2.50013 1.68519 -1.484 0.1466
## lfp 0.19357 0.20614 0.939 0.3540
## noncauc 10.39903 5.40610 1.924 0.0623 .
## southernyes 3.26216 1.32980 2.453 0.0191 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.459 on 36 degrees of freedom
## Multiple R-squared: 0.7459, Adjusted R-squared: 0.6965
## F-statistic: 15.1 on 7 and 36 DF, p-value: 5.105e-09

par(mfrow=c(2,2))
plot(reg1, main = "Residuals vs Fitted for Reg 1")

9
Residuals vs Fitted for Reg 1 Residuals vs Fitted for Reg 1

Standardized residuals
Residuals vs Fitted Normal Q−Q
0 4 8

1 28 1
Residuals

37

2
−2 0
−6

35 35

0 5 10 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Residuals vs Fitted for Reg 1 Residuals vs Fitted for Reg 1


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
28 1 1
35 28

2
1.0

0.5

0
Cook's distance 1
0.0

−3
35

0 5 10 0.0 0.2 0.4 0.6

Fitted values Leverage

Fc1 <- qf(0.95, 7, 36)


Fc1

## [1] 2.277143

Statistical Significance

Looking at the regression output, we can see that the intercept and variables convictions, executions, income,
lfp, and noncauc are not statistically significant at the 5% level as they all have p-values greater than 0.05.
Only two variables, time and southern, are statistically significant at the 5% level as they both have p-values
less than 0.05. The majority of the explanatory variables may not have statistical significance because of
the small sample size of 44. Thus, as the sample increases, the results may change.

Unrealistic Signs and/or Magnitudes

Since the p-values of time and southern are less than 0.05 at the 5% significance level, we can deduce that
the median time served and southern regional factors have statistical significance. Due to the low sample
size, there may be insufficient observations to find statistical significance accurately. The estimates for
convictions, income, time, noncauc, and southern make sense. As there are increased convictions, we expect
less incidents of murder due to risks of punishment and imprisonment. Our reasoning is - the longer time
served, the less likely it is for previously convicted murderers to commit murder again. The wealthier people
are, which also tends to be Caucasians, the less likely they are to commit murder. Lastly, southern states
are likely to have higher murder rates due to gun culture and higher poverty. Yet, estimates for labor force
participation and execution are surprising because they were positive coefficients. We would expect that if

10
people are employed, they are less likely to have the time to commit murder. Moreover, we expect that if
executions per convictions increase, that means the cost of committing murder increases, so we expect that
there would be an inverse relationship. These anomalies and unrealistic magnitudes may occur due to the
low sample size of 44. The intercept of 0.44436 could be realistic but smaller in reality for the US.

(b)

Looking at the regression output, R-squared is 0.7459, which is considerably high. This means that all the
explanatory variables explain approximately 74.59 percent of murder rate. This gives us confidence that the
model is a good start, but many more tests need to be performed to better ascertain the fit of the model.
The scatterplot of the model’s residuals gives us a small concern as there are a few outliers that make the
variance inconsistent. This may come from the fact mentioned in 1(d) that heteroskedasticity may be present
due to the nature of its cross-sectional data. However, most of the residuals look consistent and the p-value
of the model is approximately zero, which gives us confidence in the model as constructed. In terms of the
overall significance of the model, since the F-stat given in the regress output (15.1) is greater than Fc1, we
reject the null hypothesis and conclude that the explanatory variables all have joint significance on murder
rate. We also see this through the p value which is less than 0.05 at the 5% significance level.

FEATURE SELECTION

4. Test the model in (3) for multicollinearity using VIF. Based on this test
remove the appropriate variables and estimate a new regression model based on
these findings. Be sure to justify your reason/criteria for removal.

library(broom)
library(knitr)
library(car)
vif1 <- vif(reg1)
vif1

## convictions executions time income lfp noncauc


## 1.106062 1.255795 1.341514 3.168240 1.858710 2.700726
## southern
## 2.891300

Based on the results given above, all the explanatory variables have VIF’s that are less than 5. This is a
good sign that the explanatory variables do not have significant correlation or explanatory power to each
other. Thus, none of the variables need to be eliminated.

5. Using AIC or Schwartz Criterion, determine which subset of predictors you


will keep and generate a new model. Comment on the performance of this model
compared to the one in (3).

beststep <- step(reg1, direction ="backward")

## Start: AIC=86.35
## rate ~ convictions + executions + time + income + lfp + noncauc +

11
## southern
##
## Df Sum of Sq RSS AIC
## - executions 1 1.312 218.99 84.613
## - lfp 1 5.331 223.01 85.413
## <none> 217.68 86.348
## - income 1 13.309 230.99 86.960
## - convictions 1 14.699 232.38 87.224
## - noncauc 1 22.373 240.05 88.653
## - time 1 29.113 246.79 89.872
## - southern 1 36.388 254.07 91.150
##
## Step: AIC=84.61
## rate ~ convictions + time + income + lfp + noncauc + southern
##
## Df Sum of Sq RSS AIC
## - lfp 1 7.006 226.00 83.999
## <none> 218.99 84.613
## - income 1 12.666 231.66 85.087
## - convictions 1 16.056 235.05 85.726
## - noncauc 1 25.037 244.03 87.376
## - time 1 27.856 246.85 87.881
## - southern 1 38.946 257.94 89.815
##
## Step: AIC=84
## rate ~ convictions + time + income + noncauc + southern
##
## Df Sum of Sq RSS AIC
## - income 1 6.369 232.37 83.222
## <none> 226.00 83.999
## - convictions 1 21.255 247.25 85.954
## - time 1 28.030 254.03 87.143
## - southern 1 35.330 261.33 88.390
## - noncauc 1 39.586 265.58 89.101
##
## Step: AIC=83.22
## rate ~ convictions + time + noncauc + southern
##
## Df Sum of Sq RSS AIC
## <none> 232.37 83.222
## - convictions 1 20.420 252.79 84.928
## - time 1 26.694 259.06 86.006
## - noncauc 1 57.408 289.77 90.936
## - southern 1 60.181 292.55 91.355

summary(beststep)

##
## Call:
## lm(formula = rate ~ convictions + time + noncauc + southern,
## data = MurderRates)
##
## Residuals:
## Min 1Q Median 3Q Max

12
## -4.5200 -1.5375 -0.4578 1.4669 6.6747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.885154 1.434504 4.103 0.000201 ***
## convictions -4.974470 2.687056 -1.851 0.071714 .
## time -0.014578 0.006887 -2.117 0.040723 *
## noncauc 14.520544 4.677909 3.104 0.003546 **
## southernyes 3.729056 1.173343 3.178 0.002899 **
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.441 on 39 degrees of freedom
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7009
## F-statistic: 26.2 on 4 and 39 DF, p-value: 1.357e-10

Based on the backward selection above and extraction of the lowest AIC model (83.22), we conclude that the
better model has these explanatory variables: convictions, time, noncauc, and southern. All the variables in
the new model except for convictions is statistically significant at the 5% level, as seen by their respective
p-values being less than 0.05. Convictions is statistically significant at the 10% level. Compared to the
baseline model that had the majority of the parameter estimates that were not statistically significant, the
new model has all of its parameter estimates that are statistically significant with the exception of one at the
5% significance level. This encourages us to think the new model performs better than the baseline model.

6. Using the model in (5) plot the residuals versus its fitted values, ^y and
comment on your results.

residualPlot(beststep, main = "Residual Plot for Beststep Model")

13
Residual Plot for Beststep Model
6
4
Pearson residuals

2
0
−2
−4

0 2 4 6 8 10 12

Fitted values

Based on the residual plot, the model created in #5 appears to show heteroskedasticity because the residuals
show unequal variance, and there is increasing variance in residuals at higher fitted values.

7. Perform a RESET test on the model in (5) and comment on the results.

resettest(beststep, power=2, type="fitted")

##
## RESET test
##
## data: beststep
## RESET = 1.8453, df1 = 1, df2 = 38, p-value = 0.1823

Based on the results of the RESET test, the p-values of 0.1823 and 0.2817 are both greater than 0.05.
Thus, we fail to reject the null at the 5% significance level. The squared and interaction terms do not have
statistical significance, so they do not improve the model and should not be included.

14
8. Using the appropriate method learnt in class, test the model in (4) for het-
eroskedasticity and comment on the conclusion. If it is present, correct the
model before moving on. Based on the results in (c) or (d), this might be help-
ful in transforming the model in the event that its functional form presents an
issue.

GQ test

gqtest(beststep, point=0.5, alternative="greater")

##
## Goldfeld-Quandt test
##
## data: beststep
## GQ = 1.0132, df1 = 17, df2 = 17, p-value = 0.4894
## alternative hypothesis: variance increases from segment 1 to 2

Based on the GQ test, since the p value 0.489 is greater than 0.05, we fail to reject the null hypothesis that
heteroskedasticity is not present at the 5% significance level. There is insufficient evidence for heteroskedas-
ticity.

9. Using a combination of the results from the previous steps, estimate a model
based on your findings which includes interaction terms or higher power terms
(if necessary). You may need to use forward or backward selection for this.
Comment on the performance of this model. compared to your other models.
Make sure to use AIC and Schwartz criterion for model comparison.

#AIC and BIC test


kable(glance(reg1), digits=2)

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.75 0.7 2.46 15.1 0 7 - 213.22 229.27 217.68 36 44
97.61

kable(glance(beststep), digits=2)

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.73 0.7 2.44 26.2 0 4 - 210.09 220.79 232.37 39 44
99.04

Since we failed to reject the null for the reset in Q7, we would not add squared or interactino terms. Since
we failed to reject the null for the GQ test in Q8, we would not need to correct for heteroskedasticity.Based
on the AIC and Schwartz criterion for model comparison between our original reg1 and new regression
“beststep”, the AIC and BIC both decrease for our new regression. This tells us that our new regression
model is a better fit.

15
10. Provide a short 1 paragraph summary of your overall conclusion, findings,
and limitations not previously stated above.

The question we were initially trying to answer was: do the explanatory variables (convictions, executions,
time, income, lfp, noncauc, and southern) have a statistically significant effect on murder rate per 100,000?
After polishing the model with the steps function, we left only the statistically significant variables: “con-
victions, time, noncauc, and southern”. We did not add any interaction or squared terms due to the reset
test results. We also did not have to correct for multi-collinearity from the VIF test results. The R squared
also increased and the AIC and BIC both decreased, indicating that our new model is a better fit of the
variables. However, there are some limitations. The sample size of 44 is too meager, which makes our
inferences susceptible to variations and incorrect generalizations. Moreover, the histograms for our variables
indicate non-normality for most of the variables. We were unable to normalize (ie. log) these variables
due to negative values appearing as undefined in R. The non-normality and few outliers we saw in each
variable could negatively affect the accuracy of our inferences. The residual plot also shows some indication
of heteroskedasticity. These limitations could potentially be fixed with a larger sample size.

16

You might also like