Econ 104 Proj 3

Panel Datasets and Binary Dependent
Variable Datasets
Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim
2023-11-29
library(AER)
library(plm)
library(gplots)
library(lmtest)
data("Municipalities")
1. Panel Data Model
(a) Briefly discuss your data and the question you are trying to answer with your model.
is.pbalanced(Municipalities)
## [1] TRUE
which(is.na(Municipalities))
## integer(0)
Munic <- subset(Municipalities, municipality =="114"| municipality =="115"|

municipality == "120" | municipality == "123" |
municipality == "125" | municipality == "126" | municipality == "136" |
municipality == "138" | municipality == "139")
munic_pd<-pdata.frame(Munic, index = c("municipality", "year"))
pdim(munic_pd)
## Balanced Panel: n = 9, T = 9, N = 81
We will use the Municipal Expenditure Data, which is a panel data set for 265 Swedish municipalities covering
9 years (1979-1987) with 2,385 observations on 5 variables. Question: What is the effect of municipality,
year, revenues, grants (in million SEK) on total expenditures? We have selected 9 municipalities to balance
the 9 periods, making our panel data balanced (long and wide). We also have the same units over time and
the same number of observations per unit over time. Our dataset is repeated cross-sectional because we have
data from different individuals (municipalities) across several points in time.
Data Citation:
Dahlberg, M., and Johansson, E. (2000). An Examination of the Dynamic Behavior of Local Governments
Using GMM Bootstrapping Methods. Journal of Applied Econometrics, 15, 401–416.
Greene, W.H. (2003). Econometric Analysis, 5th edition. Upper Saddle River, NJ: Prentice Hall.
1
(b) Provide a descriptive analysis of your variables. This should include relevant figures with comments
including some graphical depiction of individual hetero-geneity. Ensure that all figures (or statistics)
include relevant comments about the nature of the data as indicated by the figure. (eg. whether data
is skewed; approx normal; constant across time/unit, etc.)
head(munic_pd)
## municipality year expenditures revenues grants

## 114-1979 114 1979 0.0229736 0.0181770 0.0054429
## 114-1980 114 1980 0.0266307 0.0209142 0.0057304
## 114-1981 114 1981 0.0273253 0.0210836 0.0056647
## 114-1982 114 1982 0.0288704 0.0234310 0.0058859
## 114-1983 114 1983 0.0226474 0.0179979 0.0055908
## 114-1984 114 1984 0.0215601 0.0179949 0.0047536
hist(Munic[,3], main = "Histogram of Expenditures",

xlab = "Total Expenditures in million SEK", freq = FALSE, data = Munic)
lines(density(Munic[,3]), col = "blue", lwd = 3)
Histogram of Expenditures
120
Density
20 40 60 80
0
0.015 0.020 0.025 0.030
Total Expenditures in million SEK
hist(Munic[,4], main = "Histogram of Revenues",

xlab = "Total Own-Source Revenues in million SEK", freq = FALSE, data = Munic)
2
Histogram of Revenues
150
100
Density
50
0
0.010 0.015 0.020
Total Own−Source Revenues in million SEK
hist(Munic[,5], main = "Histogram of Grants",

xlab = "Total Intergovernmental Grants received by Municipality in million SEK",
freq = FALSE, data = Munic)
3
Histogram of Grants
500
400
300
Density
200
100
0
0.0035 0.0040 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070
Total Intergovernmental Grants received by Municipality in million SEK
library(tseries)
library(forecast)
adf.test(munic_pd[,1])
##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 1]
## Dickey-Fuller = -7.9299, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
##
##
##
4
##
##
##
##
##
scatterplot(expenditures ~year|municipality, data= munic_pd)
36
0.025
expenditures
35
0.020
0.015
1979 1980 1981 1982 1983 1984 1985 1986 1987
year
5
## [1] "35" "36"
scatterplot(expenditures ~municipality|year, data= munic_pd)

0.025
expenditures
67
76
0.020
0.015
114 115 120 123 125 126 136 138 139
municipality
## [1] "67" "76"
The histograms for Expenditure and Revenues appear right skewed, and the histogram for grants appear
relatively normally distributed. Based on the Augmented Dickey Fuller test for expenditures, revenues, and
grants, all three variables are not stationary at the 5% level because their p values are above 0.05, although
minutely. This means the mean, variance, covariance, and standard deviation do vary with time. This may
require differencing to correct for non-stationarity. Heterogeneity across municipalities is seen in the first
and second scatter plot, shown by the differences in variance. The scatterplots show data points are not
constant across time and units.
(c) Fit the three models below, and identify which model is your preferred one and why. Make sure to
include your statistical diagnostics to support your conclusion, and to comment on your findings. •
Pooled Model • Fixed Effects • Random Effects
#1.For statistical diagnoses, provide appropriate tests and their conclusions and indicate whether the tests
are across individual, time or both depending on the fixed effect model used.
fe_twoways <- plm(expenditures~revenues+grants, data = munic_pd,

model = "within", effect = "twoway")
fe_time <- plm(expenditures~revenues+grants, data = munic_pd,
6
model = "within", effect = "time")
fe_individuals <- plm(expenditures~revenues+grants,
data = munic_pd, model = "within", effect = "individual")
fe_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "pooling")
re_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "random")
pFtest(fe_twoways, fe_no)
##
## F test for twoways effects
##
## data: expenditures ~ revenues + grants
## F = 1.2498, df1 = 16, df2 = 62, p-value = 0.2584
## alternative hypothesis: significant effects
pFtest(fe_time, fe_no)
##
## F test for time effects
##
## F = 2.2666, df1 = 8, df2 = 70, p-value = 0.03232
pFtest(fe_individuals, fe_no)
##
## F test for individual effects
##
## F = 0.3284, df1 = 8, df2 = 70, p-value = 0.9524
plmtest(fe_no, effect ="individual")
##
## Lagrange Multiplier Test - (Honda)
##
## normal = -1.7623, p-value = 0.961
#is this the right code
phtest(fe_time, re_no)
##
## Hausman Test
##
## chisq = 0.34732, df = 2, p-value = 0.8406
## alternative hypothesis: one model is inconsistent
7
2. Including plotmeans (or scatterplot with appropriate inputs) diagrams to informally depict individ-
ual/time/twoway heterogeneity.
plotmeans(expenditures ~ municipality, data = munic_pd)

0.024
expenditures
0.020
0.016
n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9
114 115 120 123 125 126 136 138 139
municipality
plotmeans(expenditures ~ year, data = munic_pd)
8
0.024
expenditures
0.020
0.016
n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9
1979 1980 1981 1982 1983 1984 1985 1986 1987
year
3. For the pooled model, do the bptest and comment on the existence of Heteroskedasticity and correct
for it.
pooled <- plm(expenditures~revenues+grants,model="pooling",data=munic_pd)

library(stargazer)
bptest(pooled)
##
## studentized Breusch-Pagan test
##
## data: pooled
## BP = 1.493, df = 2, p-value = 0.474
#1. STATISTICAL DIAGNOSIS Since p value is greater than 0.05, we fail to reject the null and conclude
that there are no significant effects for the two way effects test and the individual effects test.Since p value is
less than 0.05, we reject the null and conclude that there are significant effects for the only the time effects
test. This means there are only differences across time. The LM test shows us that
The hausman test results shows that we fail to reject the null and conclude that endogeneity does not exist
so it is better to use the Random Effects Model.
#2. HETEROGENEITY (WITHIN INDIVIDUALS) Through the plotmeans, we can see that the data
exhibits heterogeneity across firms and across time. We selected the random effects model. Our justification
is that since the correlation between errors on the same individual at different points in time is non-zero, it
means that we have individual specific random error.
9
Through the LM test for heterogeneity, we fail to reject the null. This means there are individual random
differences found in variances of error term.
Thus, we can use the fixed effect model.
#3. HETEROSCKEDASTICITY Through the BP test, we fail to reject the null and conclude that there is
insufficient evidence to suggest heteroskedasticity. We do not need to use the Cluster (panel)robust standard
error, which accounts for heteroskedasticity between the same individuals across time.
2. Binary Dependent Variables
(a) Briefly discuss your data and the question you are trying to answer with your model. The relationship
you are trying to describe with your regression model should be clearly stated. Must include data
citation (where is it from?) What type of variables are in your model; categorical, indicator, continuous
and how many of each? What is the dependent variable?
library(AER)
data("SwissLabor")
which(is.na(SwissLabor))
## integer(0)
Does income, age (yrs divided by 10), education, number of young kids under 7, number of older children
over 7, and whether the individual is non-Swiss/foreign influence individual participation in the Swiss labor
force? The variables “participation” and “foreign” are indicator variables, “youngkids” and “oldkids” are
discrete, and the rest are continuous variables. “Participation” is our dependent variable.
Citation: Gerfin, M. (1996). Parametric and Semi-Parametric Estimation of the Binary Response Model of
Labour Market Participation. Journal of Applied Econometrics, 11, 321–339.
(b) Provide a descriptive analysis of your variables. This should include RELEVANT histograms and
fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-
number summary). All figures must include comments. For binary variables, you can simply include
the proportions of each factor. Ensure that all figures (or statistics) include relevant comments about
the nature of the data as indicated by the figure. (eg. whether data is skewed; approx normal; constant
across time/unit, etc.)
summary(SwissLabor$participation)
## no yes
## 471 401
summary(SwissLabor$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 7.187 10.472 10.643 10.686 10.887 12.376
summary(SwissLabor$age)

## 2.000 3.200 3.900 3.996 4.800 6.200
10
summary(SwissLabor$education)

## 1.000 8.000 9.000 9.307 12.000 21.000
summary(SwissLabor$youngkids)

## 0.0000 0.0000 0.0000 0.3119 0.0000 3.0000
summary(SwissLabor$oldkids)

## 0.0000 0.0000 1.0000 0.9828 2.0000 6.0000
summary(SwissLabor$foreign)
## no yes
## 656 216
library(corrplot)
library(ggplot2)
cornew <-SwissLabor[,-1]
cornewnew <-cornew[,-6]
correlation<- cor(cornewnew)
par(mfrow=c(1,1))
corrplot(correlation, method = "circle")
11
youngkids
education
income
oldkids
age
1
income 0.8
0.6
age 0.4
0.2
education 0
−0.2
youngkids −0.4
−0.6
oldkids −0.8
−1
corrplot(correlation, method = "number")
12
youngkids
education
income
oldkids
age
1
income 1.00 0.01 0.33 −0.02 0.14 0.8
0.6
age 0.01 1.00 −0.15 −0.52 −0.12 0.4
0.2
education 0.33 −0.15 1.00 0.10 −0.04 0
−0.2
youngkids −0.02 −0.52 0.10 1.00 −0.24 −0.4
−0.6
oldkids 0.14 −0.12 −0.04 −0.24 1.00 −0.8
−1
hist(SwissLabor[,2], main = "Histogram of Income",

xlab = "Logarithm of nonlabor income", freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,2]), col = "blue", lwd = 3)
13
Histogram of Income
1.0
0.8
Density
0.6
0.4
0.2
0.0
7 8 9 10 11 12
Logarithm of nonlabor income
hist(SwissLabor[,3], main = "Histogram of Age",

xlab = "Age in decades (years divided by 10)", freq = FALSE, data = SwissLabor)
14
Histogram of Age
0.30
0.20
Density
0.10
0.00
2 3 4 5 6
Age in decades (years divided by 10)
hist(SwissLabor[,4], main = "Histogram of Education",

xlab = "Years of formal education", freq = FALSE, data = SwissLabor)
15
Histogram of Education
0.12
0.08
Density
0.04
0.00
0 5 10 15 20
Years of formal education
hist(SwissLabor[,5], main = "Histogram of Young Kids",

xlab = "Number of young children (under 7 years of age)",
freq = FALSE, data = SwissLabor)
16
Histogram of Young Kids
3
Density
2
1
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Number of young children (under 7 years of age)
hist(SwissLabor[,6], main = "Histogram of Old Kids",

xlab = "Number of older children (over 7 years of age)",
freq = FALSE, data = SwissLabor)
17
Histogram of Old Kids
0.8
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5 6
Number of older children (over 7 years of age)
boxplot(SwissLabor$income, main = "Boxplot of Log of Nonlabor Income")
18
Boxplot of Log of Nonlabor Income
12
11
10
9
8
7
boxplot(SwissLabor$age, main = "Boxplot of Age in Decades")
19
Boxplot of Age in Decades
6
5
4
3
2
boxplot(SwissLabor$education, main = "Boxplot of Years of Education")
20
Boxplot of Years of Education
20
15
10
5
boxplot(SwissLabor$youngkids, main = "Boxplot of Number of Children Under 7 Years Old")
21
Boxplot of Number of Children Under 7 Years Old
3.0
2.5
2.0
1.5
1.0
0.5
0.0
boxplot(SwissLabor$oldkids, main = "Boxplot of Number of Children Over 7 Years Old")
22
Boxplot of Number of Children Over 7 Years Old
6
5
4
3
2
1
0
The correlation plot shows that the correlation between variables are not below -80% or above 80%, which
indicates that variables are not highly correlated with each other. High correlation between variables could
lead to incorrect models, so what we see is positive. Percent composition for participation variable shows
that the group “no” which means they did not participate in the labor force is more prevalent, making up
471/872 or 54%, while the group “yes” accounts for the 401/872 or 46%. For foreign variable, the “no”
groups is also more prevalent and accounts for 656/872 or 75% whereas the “yes” group is 216/872 or 25%.
Since income is already in terms of a logarithm, the histogram appears to be normalized.The histogram of
age apppears to be quite normally distributed as well. The histogram of education, upon visual inspection,
may be slightly left-skewed but may also have a somewhat “normal distribution.” The histogram of young
kids is interesting, as it only takes the values 1, 2, or 3, and is heavily right skewed. The histogram of old
kids is also right skewed.
The boxplot of the log of nonlabor income has several outliers greater than and less than the range of the
data. The boxplot of years of education has 2 outliers above the range and 1 outlier below the range of
education. The boxplot for children under 7 years of age is very flat and near zero as most families have
zero children that meet that crieria, however there are a few outliers. The boxplot for children over 7 years
of age has an outlier of 6 and the interquartile range is from 0 to 2, with a median of 1. Since our dependent
variable is binary, a scatterplot is irrelevant.
(c) Fit the three models below, and identify which model is your preferred one and why. Make sure
to include statistical diagnostics to support your conclusion, and to comment on your findings. •
Linear Probability Model • Probit Model • Logit Model Based on the final model chosen, be sure to
include relevant marginal effects (eg. AME or MEM or at an important representative point). Using
a hypothetical threshold of 0.5, which model best describes given data using predictive performance?
You may use other metrics for comparisons such as likelihood ratios
23
library(margins)
swisslab.lpm <- lm(participation~income+age+education+youngkids+oldkids+foreign,

data = SwissLabor)
predict(swisslab.lpm, data.frame(income=10,age=5.0,education=8,youngkids=0,
oldkids=1,foreign="no"), type="response")
## 1
## 1.469198
margins(swisslab.lpm)
## income age education youngkids oldkids foreignyes

## -0.1679 -0.1062 0.007232 -0.2596 -0.00286 0.2848
swisslab.probit <- glm(participation~income+age+education+youngkids+oldkids+foreign,

data = SwissLabor, family = binomial(link = "probit"))
margins(swisslab.probit)

## -0.1729 -0.1069 0.00702 -0.2699 -0.004638 0.2856
swisslab.logit <- glm(participation~income+age+education+youngkids+oldkids+foreign,

data = SwissLabor, family = binomial(link = "logit"))
margins(swisslab.logit)

## -0.1699 -0.1064 0.006616 -0.2775 -0.004584 0.2834
library(tidyverse)
AIC(swisslab.lpm)
## [1] NA
AIC(swisslab.logit)
## [1] 1066.798
AIC(swisslab.probit)
## [1] 1066.983
BIC(swisslab.lpm)
## [1] NA
24
BIC(swisslab.logit)
## [1] 1100.193
BIC(swisslab.probit)
## [1] 1100.378
Based on the BIC and AIC tests, a logit model is most suitable.
25

Econ 104 Proj 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econ 104 Proj 3

Uploaded by

Copyright:

Available Formats

Panel Datasets and Binary Dependent

1. Panel Data Model

Munic <- subset(Municipalities, municipality =="114"| municipality =="115"|

## municipality year expenditures revenues grants

hist(Munic[,3], main = "Histogram of Expenditures",

0.015 0.020 0.025 0.030

Total Expenditures in million SEK

hist(Munic[,4], main = "Histogram of Revenues",

0.010 0.015 0.020

Total Own−Source Revenues in million SEK

hist(Munic[,5], main = "Histogram of Grants",

0.0035 0.0040 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070

Total Intergovernmental Grants received by Municipality in million SEK

scatterplot(expenditures ~year|municipality, data= munic_pd)

1979 1980 1981 1982 1983 1984 1985 1986 1987

scatterplot(expenditures ~municipality|year, data= munic_pd)

114 115 120 123 125 126 136 138 139

## [1] "67" "76"

fe_twoways <- plm(expenditures~revenues+grants, data = munic_pd,

plmtest(fe_no, effect ="individual")

#is this the right code

plotmeans(expenditures ~ municipality, data = munic_pd)

n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9

114 115 120 123 125 126 136 138 139

plotmeans(expenditures ~ year, data = munic_pd)

n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9

1979 1980 1981 1982 1983 1984 1985 1986 1987

pooled <- plm(expenditures~revenues+grants,model="pooling",data=munic_pd)

2. Binary Dependent Variables

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

corrplot(correlation, method = "number")

income 1.00 0.01 0.33 −0.02 0.14 0.8

age 0.01 1.00 −0.15 −0.52 −0.12 0.4

education 0.33 −0.15 1.00 0.10 −0.04 0

youngkids −0.02 −0.52 0.10 1.00 −0.24 −0.4

oldkids 0.14 −0.12 −0.04 −0.24 1.00 −0.8

hist(SwissLabor[,2], main = "Histogram of Income",

Logarithm of nonlabor income

hist(SwissLabor[,3], main = "Histogram of Age",

Age in decades (years divided by 10)

hist(SwissLabor[,4], main = "Histogram of Education",

Years of formal education

hist(SwissLabor[,5], main = "Histogram of Young Kids",

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Number of young children (under 7 years of age)

hist(SwissLabor[,6], main = "Histogram of Old Kids",

Number of older children (over 7 years of age)

boxplot(SwissLabor$income, main = "Boxplot of Log of Nonlabor Income")

boxplot(SwissLabor$age, main = "Boxplot of Age in Decades")

boxplot(SwissLabor$education, main = "Boxplot of Years of Education")

boxplot(SwissLabor$youngkids, main = "Boxplot of Number of Children Under 7 Years Old")

boxplot(SwissLabor$oldkids, main = "Boxplot of Number of Children Over 7 Years Old")

swisslab.lpm <- lm(participation~income+age+education+youngkids+oldkids+foreign,

## income age education youngkids oldkids foreignyes

swisslab.probit <- glm(participation~income+age+education+youngkids+oldkids+foreign,

## income age education youngkids oldkids foreignyes

swisslab.logit <- glm(participation~income+age+education+youngkids+oldkids+foreign,

## income age education youngkids oldkids foreignyes

You might also like