Professional Documents
Culture Documents
Variable Datasets
Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim
2023-11-29
library(AER)
library(plm)
library(gplots)
library(lmtest)
data("Municipalities")
(a) Briefly discuss your data and the question you are trying to answer with your model.
is.pbalanced(Municipalities)
## [1] TRUE
which(is.na(Municipalities))
## integer(0)
## Balanced Panel: n = 9, T = 9, N = 81
We will use the Municipal Expenditure Data, which is a panel data set for 265 Swedish municipalities covering
9 years (1979-1987) with 2,385 observations on 5 variables. Question: What is the effect of municipality,
year, revenues, grants (in million SEK) on total expenditures? We have selected 9 municipalities to balance
the 9 periods, making our panel data balanced (long and wide). We also have the same units over time and
the same number of observations per unit over time. Our dataset is repeated cross-sectional because we have
data from different individuals (municipalities) across several points in time.
Data Citation:
Dahlberg, M., and Johansson, E. (2000). An Examination of the Dynamic Behavior of Local Governments
Using GMM Bootstrapping Methods. Journal of Applied Econometrics, 15, 401–416.
Greene, W.H. (2003). Econometric Analysis, 5th edition. Upper Saddle River, NJ: Prentice Hall.
1
(b) Provide a descriptive analysis of your variables. This should include relevant figures with comments
including some graphical depiction of individual hetero-geneity. Ensure that all figures (or statistics)
include relevant comments about the nature of the data as indicated by the figure. (eg. whether data
is skewed; approx normal; constant across time/unit, etc.)
head(munic_pd)
Histogram of Expenditures
120
Density
20 40 60 80
0
2
Histogram of Revenues
150
100
Density
50
0
3
Histogram of Grants
500
400
300
Density
200
100
0
library(tseries)
library(forecast)
adf.test(munic_pd[,1])
##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 1]
## Dickey-Fuller = -7.9299, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
adf.test(munic_pd[,2])
##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 2]
## Dickey-Fuller = -7.9299, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
adf.test(munic_pd[,3])
##
## Augmented Dickey-Fuller Test
4
##
## data: munic_pd[, 3]
## Dickey-Fuller = -3.434, Lag order = 4, p-value = 0.05584
## alternative hypothesis: stationary
adf.test(munic_pd[,4])
##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 4]
## Dickey-Fuller = -3.45, Lag order = 4, p-value = 0.05324
## alternative hypothesis: stationary
adf.test(munic_pd[,5])
##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 5]
## Dickey-Fuller = -3.4353, Lag order = 4, p-value = 0.05563
## alternative hypothesis: stationary
36
0.025
expenditures
35
0.020
0.015
year
5
## [1] "35" "36"
67
76
0.020
0.015
municipality
The histograms for Expenditure and Revenues appear right skewed, and the histogram for grants appear
relatively normally distributed. Based on the Augmented Dickey Fuller test for expenditures, revenues, and
grants, all three variables are not stationary at the 5% level because their p values are above 0.05, although
minutely. This means the mean, variance, covariance, and standard deviation do vary with time. This may
require differencing to correct for non-stationarity. Heterogeneity across municipalities is seen in the first
and second scatter plot, shown by the differences in variance. The scatterplots show data points are not
constant across time and units.
(c) Fit the three models below, and identify which model is your preferred one and why. Make sure to
include your statistical diagnostics to support your conclusion, and to comment on your findings. •
Pooled Model • Fixed Effects • Random Effects
#1.For statistical diagnoses, provide appropriate tests and their conclusions and indicate whether the tests
are across individual, time or both depending on the fixed effect model used.
6
model = "within", effect = "time")
fe_individuals <- plm(expenditures~revenues+grants,
data = munic_pd, model = "within", effect = "individual")
fe_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "pooling")
re_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "random")
pFtest(fe_twoways, fe_no)
##
## F test for twoways effects
##
## data: expenditures ~ revenues + grants
## F = 1.2498, df1 = 16, df2 = 62, p-value = 0.2584
## alternative hypothesis: significant effects
pFtest(fe_time, fe_no)
##
## F test for time effects
##
## data: expenditures ~ revenues + grants
## F = 2.2666, df1 = 8, df2 = 70, p-value = 0.03232
## alternative hypothesis: significant effects
pFtest(fe_individuals, fe_no)
##
## F test for individual effects
##
## data: expenditures ~ revenues + grants
## F = 0.3284, df1 = 8, df2 = 70, p-value = 0.9524
## alternative hypothesis: significant effects
##
## Lagrange Multiplier Test - (Honda)
##
## data: expenditures ~ revenues + grants
## normal = -1.7623, p-value = 0.961
## alternative hypothesis: significant effects
phtest(fe_time, re_no)
##
## Hausman Test
##
## data: expenditures ~ revenues + grants
## chisq = 0.34732, df = 2, p-value = 0.8406
## alternative hypothesis: one model is inconsistent
7
2. Including plotmeans (or scatterplot with appropriate inputs) diagrams to informally depict individ-
ual/time/twoway heterogeneity.
0.020
0.016
municipality
8
0.024
expenditures
0.020
0.016
year
3. For the pooled model, do the bptest and comment on the existence of Heteroskedasticity and correct
for it.
##
## studentized Breusch-Pagan test
##
## data: pooled
## BP = 1.493, df = 2, p-value = 0.474
#1. STATISTICAL DIAGNOSIS Since p value is greater than 0.05, we fail to reject the null and conclude
that there are no significant effects for the two way effects test and the individual effects test.Since p value is
less than 0.05, we reject the null and conclude that there are significant effects for the only the time effects
test. This means there are only differences across time. The LM test shows us that
The hausman test results shows that we fail to reject the null and conclude that endogeneity does not exist
so it is better to use the Random Effects Model.
#2. HETEROGENEITY (WITHIN INDIVIDUALS) Through the plotmeans, we can see that the data
exhibits heterogeneity across firms and across time. We selected the random effects model. Our justification
is that since the correlation between errors on the same individual at different points in time is non-zero, it
means that we have individual specific random error.
9
Through the LM test for heterogeneity, we fail to reject the null. This means there are individual random
differences found in variances of error term.
Thus, we can use the fixed effect model.
#3. HETEROSCKEDASTICITY Through the BP test, we fail to reject the null and conclude that there is
insufficient evidence to suggest heteroskedasticity. We do not need to use the Cluster (panel)robust standard
error, which accounts for heteroskedasticity between the same individuals across time.
(a) Briefly discuss your data and the question you are trying to answer with your model. The relationship
you are trying to describe with your regression model should be clearly stated. Must include data
citation (where is it from?) What type of variables are in your model; categorical, indicator, continuous
and how many of each? What is the dependent variable?
library(AER)
data("SwissLabor")
which(is.na(SwissLabor))
## integer(0)
Does income, age (yrs divided by 10), education, number of young kids under 7, number of older children
over 7, and whether the individual is non-Swiss/foreign influence individual participation in the Swiss labor
force? The variables “participation” and “foreign” are indicator variables, “youngkids” and “oldkids” are
discrete, and the rest are continuous variables. “Participation” is our dependent variable.
Citation: Gerfin, M. (1996). Parametric and Semi-Parametric Estimation of the Binary Response Model of
Labour Market Participation. Journal of Applied Econometrics, 11, 321–339.
(b) Provide a descriptive analysis of your variables. This should include RELEVANT histograms and
fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-
number summary). All figures must include comments. For binary variables, you can simply include
the proportions of each factor. Ensure that all figures (or statistics) include relevant comments about
the nature of the data as indicated by the figure. (eg. whether data is skewed; approx normal; constant
across time/unit, etc.)
summary(SwissLabor$participation)
## no yes
## 471 401
summary(SwissLabor$income)
summary(SwissLabor$age)
10
summary(SwissLabor$education)
summary(SwissLabor$youngkids)
summary(SwissLabor$oldkids)
summary(SwissLabor$foreign)
## no yes
## 656 216
library(corrplot)
library(ggplot2)
cornew <-SwissLabor[,-1]
cornewnew <-cornew[,-6]
correlation<- cor(cornewnew)
par(mfrow=c(1,1))
corrplot(correlation, method = "circle")
11
youngkids
education
income
oldkids
age
1
income 0.8
0.6
age 0.4
0.2
education 0
−0.2
youngkids −0.4
−0.6
oldkids −0.8
−1
12
youngkids
education
income
oldkids
age
1
0.6
0.2
−0.2
−0.6
−1
13
Histogram of Income
1.0
0.8
Density
0.6
0.4
0.2
0.0
7 8 9 10 11 12
14
Histogram of Age
0.30
0.20
Density
0.10
0.00
2 3 4 5 6
15
Histogram of Education
0.12
0.08
Density
0.04
0.00
0 5 10 15 20
16
Histogram of Young Kids
3
Density
2
1
0
17
Histogram of Old Kids
0.8
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5 6
18
Boxplot of Log of Nonlabor Income
12
11
10
9
8
7
19
Boxplot of Age in Decades
6
5
4
3
2
20
Boxplot of Years of Education
20
15
10
5
21
Boxplot of Number of Children Under 7 Years Old
3.0
2.5
2.0
1.5
1.0
0.5
0.0
22
Boxplot of Number of Children Over 7 Years Old
6
5
4
3
2
1
0
The correlation plot shows that the correlation between variables are not below -80% or above 80%, which
indicates that variables are not highly correlated with each other. High correlation between variables could
lead to incorrect models, so what we see is positive. Percent composition for participation variable shows
that the group “no” which means they did not participate in the labor force is more prevalent, making up
471/872 or 54%, while the group “yes” accounts for the 401/872 or 46%. For foreign variable, the “no”
groups is also more prevalent and accounts for 656/872 or 75% whereas the “yes” group is 216/872 or 25%.
Since income is already in terms of a logarithm, the histogram appears to be normalized.The histogram of
age apppears to be quite normally distributed as well. The histogram of education, upon visual inspection,
may be slightly left-skewed but may also have a somewhat “normal distribution.” The histogram of young
kids is interesting, as it only takes the values 1, 2, or 3, and is heavily right skewed. The histogram of old
kids is also right skewed.
The boxplot of the log of nonlabor income has several outliers greater than and less than the range of the
data. The boxplot of years of education has 2 outliers above the range and 1 outlier below the range of
education. The boxplot for children under 7 years of age is very flat and near zero as most families have
zero children that meet that crieria, however there are a few outliers. The boxplot for children over 7 years
of age has an outlier of 6 and the interquartile range is from 0 to 2, with a median of 1. Since our dependent
variable is binary, a scatterplot is irrelevant.
(c) Fit the three models below, and identify which model is your preferred one and why. Make sure
to include statistical diagnostics to support your conclusion, and to comment on your findings. •
Linear Probability Model • Probit Model • Logit Model Based on the final model chosen, be sure to
include relevant marginal effects (eg. AME or MEM or at an important representative point). Using
a hypothetical threshold of 0.5, which model best describes given data using predictive performance?
You may use other metrics for comparisons such as likelihood ratios
23
library(margins)
predict(swisslab.lpm, data.frame(income=10,age=5.0,education=8,youngkids=0,
oldkids=1,foreign="no"), type="response")
## 1
## 1.469198
margins(swisslab.lpm)
margins(swisslab.probit)
margins(swisslab.logit)
library(tidyverse)
AIC(swisslab.lpm)
## [1] NA
AIC(swisslab.logit)
## [1] 1066.798
AIC(swisslab.probit)
## [1] 1066.983
BIC(swisslab.lpm)
## [1] NA
24
BIC(swisslab.logit)
## [1] 1100.193
BIC(swisslab.probit)
## [1] 1100.378
Based on the BIC and AIC tests, a logit model is most suitable.
25