You are on page 1of 25

Panel Datasets and Binary Dependent

Variable Datasets
Andrew Flores, Jeremy Flores, Shannon Park, JunGyo Kim

2023-11-29
library(AER)
library(plm)
library(gplots)
library(lmtest)
data("Municipalities")

1. Panel Data Model

(a) Briefly discuss your data and the question you are trying to answer with your model.

is.pbalanced(Municipalities)

## [1] TRUE

which(is.na(Municipalities))

## integer(0)

Munic <- subset(Municipalities, municipality =="114"| municipality =="115"|


municipality == "120" | municipality == "123" |
municipality == "125" | municipality == "126" | municipality == "136" |
municipality == "138" | municipality == "139")
munic_pd<-pdata.frame(Munic, index = c("municipality", "year"))
pdim(munic_pd)

## Balanced Panel: n = 9, T = 9, N = 81

We will use the Municipal Expenditure Data, which is a panel data set for 265 Swedish municipalities covering
9 years (1979-1987) with 2,385 observations on 5 variables. Question: What is the effect of municipality,
year, revenues, grants (in million SEK) on total expenditures? We have selected 9 municipalities to balance
the 9 periods, making our panel data balanced (long and wide). We also have the same units over time and
the same number of observations per unit over time. Our dataset is repeated cross-sectional because we have
data from different individuals (municipalities) across several points in time.
Data Citation:
Dahlberg, M., and Johansson, E. (2000). An Examination of the Dynamic Behavior of Local Governments
Using GMM Bootstrapping Methods. Journal of Applied Econometrics, 15, 401–416.
Greene, W.H. (2003). Econometric Analysis, 5th edition. Upper Saddle River, NJ: Prentice Hall.

1
(b) Provide a descriptive analysis of your variables. This should include relevant figures with comments
including some graphical depiction of individual hetero-geneity. Ensure that all figures (or statistics)
include relevant comments about the nature of the data as indicated by the figure. (eg. whether data
is skewed; approx normal; constant across time/unit, etc.)

head(munic_pd)

## municipality year expenditures revenues grants


## 114-1979 114 1979 0.0229736 0.0181770 0.0054429
## 114-1980 114 1980 0.0266307 0.0209142 0.0057304
## 114-1981 114 1981 0.0273253 0.0210836 0.0056647
## 114-1982 114 1982 0.0288704 0.0234310 0.0058859
## 114-1983 114 1983 0.0226474 0.0179979 0.0055908
## 114-1984 114 1984 0.0215601 0.0179949 0.0047536

hist(Munic[,3], main = "Histogram of Expenditures",


xlab = "Total Expenditures in million SEK", freq = FALSE, data = Munic)
lines(density(Munic[,3]), col = "blue", lwd = 3)

Histogram of Expenditures
120
Density

20 40 60 80
0

0.015 0.020 0.025 0.030

Total Expenditures in million SEK

hist(Munic[,4], main = "Histogram of Revenues",


xlab = "Total Own-Source Revenues in million SEK", freq = FALSE, data = Munic)
lines(density(Munic[,4]), col = "blue", lwd = 3)

2
Histogram of Revenues
150
100
Density

50
0

0.010 0.015 0.020

Total Own−Source Revenues in million SEK

hist(Munic[,5], main = "Histogram of Grants",


xlab = "Total Intergovernmental Grants received by Municipality in million SEK",
freq = FALSE, data = Munic)
lines(density(Munic[,5]), col = "blue", lwd = 3)

3
Histogram of Grants
500
400
300
Density

200
100
0

0.0035 0.0040 0.0045 0.0050 0.0055 0.0060 0.0065 0.0070

Total Intergovernmental Grants received by Municipality in million SEK

library(tseries)
library(forecast)
adf.test(munic_pd[,1])

##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 1]
## Dickey-Fuller = -7.9299, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

adf.test(munic_pd[,2])

##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 2]
## Dickey-Fuller = -7.9299, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

adf.test(munic_pd[,3])

##
## Augmented Dickey-Fuller Test

4
##
## data: munic_pd[, 3]
## Dickey-Fuller = -3.434, Lag order = 4, p-value = 0.05584
## alternative hypothesis: stationary

adf.test(munic_pd[,4])

##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 4]
## Dickey-Fuller = -3.45, Lag order = 4, p-value = 0.05324
## alternative hypothesis: stationary

adf.test(munic_pd[,5])

##
## Augmented Dickey-Fuller Test
##
## data: munic_pd[, 5]
## Dickey-Fuller = -3.4353, Lag order = 4, p-value = 0.05563
## alternative hypothesis: stationary

scatterplot(expenditures ~year|municipality, data= munic_pd)

36
0.025
expenditures

35
0.020
0.015

1979 1980 1981 1982 1983 1984 1985 1986 1987

year

5
## [1] "35" "36"

scatterplot(expenditures ~municipality|year, data= munic_pd)


0.025
expenditures

67
76
0.020
0.015

114 115 120 123 125 126 136 138 139

municipality

## [1] "67" "76"

The histograms for Expenditure and Revenues appear right skewed, and the histogram for grants appear
relatively normally distributed. Based on the Augmented Dickey Fuller test for expenditures, revenues, and
grants, all three variables are not stationary at the 5% level because their p values are above 0.05, although
minutely. This means the mean, variance, covariance, and standard deviation do vary with time. This may
require differencing to correct for non-stationarity. Heterogeneity across municipalities is seen in the first
and second scatter plot, shown by the differences in variance. The scatterplots show data points are not
constant across time and units.

(c) Fit the three models below, and identify which model is your preferred one and why. Make sure to
include your statistical diagnostics to support your conclusion, and to comment on your findings. •
Pooled Model • Fixed Effects • Random Effects

#1.For statistical diagnoses, provide appropriate tests and their conclusions and indicate whether the tests
are across individual, time or both depending on the fixed effect model used.

fe_twoways <- plm(expenditures~revenues+grants, data = munic_pd,


model = "within", effect = "twoway")
fe_time <- plm(expenditures~revenues+grants, data = munic_pd,

6
model = "within", effect = "time")
fe_individuals <- plm(expenditures~revenues+grants,
data = munic_pd, model = "within", effect = "individual")
fe_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "pooling")
re_no <- plm(expenditures~revenues+grants, data = munic_pd, model = "random")

pFtest(fe_twoways, fe_no)

##
## F test for twoways effects
##
## data: expenditures ~ revenues + grants
## F = 1.2498, df1 = 16, df2 = 62, p-value = 0.2584
## alternative hypothesis: significant effects

pFtest(fe_time, fe_no)

##
## F test for time effects
##
## data: expenditures ~ revenues + grants
## F = 2.2666, df1 = 8, df2 = 70, p-value = 0.03232
## alternative hypothesis: significant effects

pFtest(fe_individuals, fe_no)

##
## F test for individual effects
##
## data: expenditures ~ revenues + grants
## F = 0.3284, df1 = 8, df2 = 70, p-value = 0.9524
## alternative hypothesis: significant effects

plmtest(fe_no, effect ="individual")

##
## Lagrange Multiplier Test - (Honda)
##
## data: expenditures ~ revenues + grants
## normal = -1.7623, p-value = 0.961
## alternative hypothesis: significant effects

#is this the right code

phtest(fe_time, re_no)

##
## Hausman Test
##
## data: expenditures ~ revenues + grants
## chisq = 0.34732, df = 2, p-value = 0.8406
## alternative hypothesis: one model is inconsistent

7
2. Including plotmeans (or scatterplot with appropriate inputs) diagrams to informally depict individ-
ual/time/twoway heterogeneity.

plotmeans(expenditures ~ municipality, data = munic_pd)


0.024
expenditures

0.020
0.016

n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9

114 115 120 123 125 126 136 138 139

municipality

plotmeans(expenditures ~ year, data = munic_pd)

8
0.024
expenditures

0.020
0.016

n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9 n=9

1979 1980 1981 1982 1983 1984 1985 1986 1987

year

3. For the pooled model, do the bptest and comment on the existence of Heteroskedasticity and correct
for it.

pooled <- plm(expenditures~revenues+grants,model="pooling",data=munic_pd)


library(stargazer)
bptest(pooled)

##
## studentized Breusch-Pagan test
##
## data: pooled
## BP = 1.493, df = 2, p-value = 0.474

#1. STATISTICAL DIAGNOSIS Since p value is greater than 0.05, we fail to reject the null and conclude
that there are no significant effects for the two way effects test and the individual effects test.Since p value is
less than 0.05, we reject the null and conclude that there are significant effects for the only the time effects
test. This means there are only differences across time. The LM test shows us that
The hausman test results shows that we fail to reject the null and conclude that endogeneity does not exist
so it is better to use the Random Effects Model.
#2. HETEROGENEITY (WITHIN INDIVIDUALS) Through the plotmeans, we can see that the data
exhibits heterogeneity across firms and across time. We selected the random effects model. Our justification
is that since the correlation between errors on the same individual at different points in time is non-zero, it
means that we have individual specific random error.

9
Through the LM test for heterogeneity, we fail to reject the null. This means there are individual random
differences found in variances of error term.
Thus, we can use the fixed effect model.
#3. HETEROSCKEDASTICITY Through the BP test, we fail to reject the null and conclude that there is
insufficient evidence to suggest heteroskedasticity. We do not need to use the Cluster (panel)robust standard
error, which accounts for heteroskedasticity between the same individuals across time.

2. Binary Dependent Variables

(a) Briefly discuss your data and the question you are trying to answer with your model. The relationship
you are trying to describe with your regression model should be clearly stated. Must include data
citation (where is it from?) What type of variables are in your model; categorical, indicator, continuous
and how many of each? What is the dependent variable?

library(AER)
data("SwissLabor")
which(is.na(SwissLabor))

## integer(0)

Does income, age (yrs divided by 10), education, number of young kids under 7, number of older children
over 7, and whether the individual is non-Swiss/foreign influence individual participation in the Swiss labor
force? The variables “participation” and “foreign” are indicator variables, “youngkids” and “oldkids” are
discrete, and the rest are continuous variables. “Participation” is our dependent variable.
Citation: Gerfin, M. (1996). Parametric and Semi-Parametric Estimation of the Binary Response Model of
Labour Market Participation. Journal of Applied Econometrics, 11, 321–339.

(b) Provide a descriptive analysis of your variables. This should include RELEVANT histograms and
fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-
number summary). All figures must include comments. For binary variables, you can simply include
the proportions of each factor. Ensure that all figures (or statistics) include relevant comments about
the nature of the data as indicated by the figure. (eg. whether data is skewed; approx normal; constant
across time/unit, etc.)

summary(SwissLabor$participation)

## no yes
## 471 401

summary(SwissLabor$income)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 7.187 10.472 10.643 10.686 10.887 12.376

summary(SwissLabor$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 2.000 3.200 3.900 3.996 4.800 6.200

10
summary(SwissLabor$education)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1.000 8.000 9.000 9.307 12.000 21.000

summary(SwissLabor$youngkids)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.0000 0.0000 0.0000 0.3119 0.0000 3.0000

summary(SwissLabor$oldkids)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.0000 0.0000 1.0000 0.9828 2.0000 6.0000

summary(SwissLabor$foreign)

## no yes
## 656 216

library(corrplot)
library(ggplot2)
cornew <-SwissLabor[,-1]
cornewnew <-cornew[,-6]
correlation<- cor(cornewnew)
par(mfrow=c(1,1))
corrplot(correlation, method = "circle")

11
youngkids
education
income

oldkids
age
1

income 0.8

0.6

age 0.4

0.2

education 0

−0.2

youngkids −0.4

−0.6

oldkids −0.8

−1

corrplot(correlation, method = "number")

12
youngkids
education
income

oldkids
age
1

income 1.00 0.01 0.33 −0.02 0.14 0.8

0.6

age 0.01 1.00 −0.15 −0.52 −0.12 0.4

0.2

education 0.33 −0.15 1.00 0.10 −0.04 0

−0.2

youngkids −0.02 −0.52 0.10 1.00 −0.24 −0.4

−0.6

oldkids 0.14 −0.12 −0.04 −0.24 1.00 −0.8

−1

hist(SwissLabor[,2], main = "Histogram of Income",


xlab = "Logarithm of nonlabor income", freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,2]), col = "blue", lwd = 3)

13
Histogram of Income
1.0
0.8
Density

0.6
0.4
0.2
0.0

7 8 9 10 11 12

Logarithm of nonlabor income

hist(SwissLabor[,3], main = "Histogram of Age",


xlab = "Age in decades (years divided by 10)", freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,3]), col = "blue", lwd = 3)

14
Histogram of Age
0.30
0.20
Density

0.10
0.00

2 3 4 5 6

Age in decades (years divided by 10)

hist(SwissLabor[,4], main = "Histogram of Education",


xlab = "Years of formal education", freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,4]), col = "blue", lwd = 3)

15
Histogram of Education
0.12
0.08
Density

0.04
0.00

0 5 10 15 20

Years of formal education

hist(SwissLabor[,5], main = "Histogram of Young Kids",


xlab = "Number of young children (under 7 years of age)",
freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,5]), col = "blue", lwd = 3)

16
Histogram of Young Kids
3
Density

2
1
0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Number of young children (under 7 years of age)

hist(SwissLabor[,6], main = "Histogram of Old Kids",


xlab = "Number of older children (over 7 years of age)",
freq = FALSE, data = SwissLabor)
lines(density(SwissLabor[,6]), col = "blue", lwd = 3)

17
Histogram of Old Kids
0.8
0.6
Density

0.4
0.2
0.0

0 1 2 3 4 5 6

Number of older children (over 7 years of age)

boxplot(SwissLabor$income, main = "Boxplot of Log of Nonlabor Income")

18
Boxplot of Log of Nonlabor Income
12
11
10
9
8
7

boxplot(SwissLabor$age, main = "Boxplot of Age in Decades")

19
Boxplot of Age in Decades
6
5
4
3
2

boxplot(SwissLabor$education, main = "Boxplot of Years of Education")

20
Boxplot of Years of Education
20
15
10
5

boxplot(SwissLabor$youngkids, main = "Boxplot of Number of Children Under 7 Years Old")

21
Boxplot of Number of Children Under 7 Years Old
3.0
2.5
2.0
1.5
1.0
0.5
0.0

boxplot(SwissLabor$oldkids, main = "Boxplot of Number of Children Over 7 Years Old")

22
Boxplot of Number of Children Over 7 Years Old
6
5
4
3
2
1
0

The correlation plot shows that the correlation between variables are not below -80% or above 80%, which
indicates that variables are not highly correlated with each other. High correlation between variables could
lead to incorrect models, so what we see is positive. Percent composition for participation variable shows
that the group “no” which means they did not participate in the labor force is more prevalent, making up
471/872 or 54%, while the group “yes” accounts for the 401/872 or 46%. For foreign variable, the “no”
groups is also more prevalent and accounts for 656/872 or 75% whereas the “yes” group is 216/872 or 25%.
Since income is already in terms of a logarithm, the histogram appears to be normalized.The histogram of
age apppears to be quite normally distributed as well. The histogram of education, upon visual inspection,
may be slightly left-skewed but may also have a somewhat “normal distribution.” The histogram of young
kids is interesting, as it only takes the values 1, 2, or 3, and is heavily right skewed. The histogram of old
kids is also right skewed.
The boxplot of the log of nonlabor income has several outliers greater than and less than the range of the
data. The boxplot of years of education has 2 outliers above the range and 1 outlier below the range of
education. The boxplot for children under 7 years of age is very flat and near zero as most families have
zero children that meet that crieria, however there are a few outliers. The boxplot for children over 7 years
of age has an outlier of 6 and the interquartile range is from 0 to 2, with a median of 1. Since our dependent
variable is binary, a scatterplot is irrelevant.

(c) Fit the three models below, and identify which model is your preferred one and why. Make sure
to include statistical diagnostics to support your conclusion, and to comment on your findings. •
Linear Probability Model • Probit Model • Logit Model Based on the final model chosen, be sure to
include relevant marginal effects (eg. AME or MEM or at an important representative point). Using
a hypothetical threshold of 0.5, which model best describes given data using predictive performance?
You may use other metrics for comparisons such as likelihood ratios

23
library(margins)

swisslab.lpm <- lm(participation~income+age+education+youngkids+oldkids+foreign,


data = SwissLabor)

predict(swisslab.lpm, data.frame(income=10,age=5.0,education=8,youngkids=0,
oldkids=1,foreign="no"), type="response")

## 1
## 1.469198

margins(swisslab.lpm)

## income age education youngkids oldkids foreignyes


## -0.1679 -0.1062 0.007232 -0.2596 -0.00286 0.2848

swisslab.probit <- glm(participation~income+age+education+youngkids+oldkids+foreign,


data = SwissLabor, family = binomial(link = "probit"))

margins(swisslab.probit)

## income age education youngkids oldkids foreignyes


## -0.1729 -0.1069 0.00702 -0.2699 -0.004638 0.2856

swisslab.logit <- glm(participation~income+age+education+youngkids+oldkids+foreign,


data = SwissLabor, family = binomial(link = "logit"))

margins(swisslab.logit)

## income age education youngkids oldkids foreignyes


## -0.1699 -0.1064 0.006616 -0.2775 -0.004584 0.2834

library(tidyverse)
AIC(swisslab.lpm)

## [1] NA

AIC(swisslab.logit)

## [1] 1066.798

AIC(swisslab.probit)

## [1] 1066.983

BIC(swisslab.lpm)

## [1] NA

24
BIC(swisslab.logit)

## [1] 1100.193

BIC(swisslab.probit)

## [1] 1100.378

Based on the BIC and AIC tests, a logit model is most suitable.

25

You might also like