Micecx

Basics Endogenous regressors Maximum likelihood Limited dependent variables Panel data
Microeconometrics
Based on the textbooks
Verbeek: A Guide to Modern Econometrics
and Cameron and Trivedi: Microeconometrics
Robert M. Kunst
robert.kunst@univie.ac.at
University of Vienna
and
Institute for Advanced Studies Vienna
October 4, 2013
Microeconometrics University of Vienna and Institute for Advanced Studies Vienna
Outline
Basics
Ordinary least squares
The linear regression model
Goodness of t
Restriction tests on coecients
Asymptotic properties of OLS
Endogenous regressors
Maximum likelihood
Limited dependent variables
Binary choice models
Multiresponse models
Panel data
What is micro-econometrics?
Micro-econometrics is concerned with the statistical analysis of
individual (not aggregated) economic data. Some aspects of such
data imply special emphasis:
Samples are larger than for economic aggregates, often

thousands of observations are available;
Non-linear functional dependence may be identied from data

(for aggregates linear or pre-specied functions must suce);
Qualitative or limited-dependent variables are common;
Individuals can often be regarded as independent, random

sampling assumptions can be plausible (no serial correlation);
Individuals are often heterogeneous: heteroskedasticity;
The logical sequences in time are unavailable. Cause and

eect are identied from theory, not from the data.
Linear regression: ordinary least squares
There are N observations on a dependent variable y and on
K 1 explanatory variables (regressors) x
2
, . . . , x
K
. The
best-tting linear combination that minimizes
S(
) =
N
i =1
(y
i
x
i

)
2
is the OLS (ordinary least squares) vector b = (b
1
, . . . , b
K
)
.
Notation: x
1
1 but x
1
= (1, x
12
, . . . , x
1K
).
The normal equations of OLS
To minimize S(
), equate its derivative to 0 (normal equations):

2
N
i =1
x
i
(y
i
x
i
b) = 0
_
N
i =1
x
i
x
i
_
b =
N
i =1
x
i
y
i
,
or simply, in matrix and vector notation, using y = (y
1
, . . . , y
N
)
and X = (x
1
, . . . , x
N
)
,
b = (X
X)
1
X
y.
Notation: Here, x
i
= (1, x
i 2
, . . . , x
iK
)
.
therefore
y and residuals
The (systematic or predicted) value
y
i
= x
i
b
is the best linear approximation of y from x
2
, . . . , x
K
, the best
approximation by linear combinations. The dierence between y
and y is called the residual
e
i
= y
i
y
i
= y
i
x
i
b.
Residual sum of squares
The function S(.) evaluated at its minimum
S(b) =
N
i =1
(y
i
y
i
)
2
=
N
i =1
e
2
i
is the residual sum of squares (RSS). Because of the normal
equations,
N
i =1
x
i
(y
i
x
i
b) =
N
i =1
x
i
e
i
= 0,
such that residuals and regressors are orthogonal. Also,
N
i =1
e
i
= 0 =
N
i =1
(y
i
x
i
b) = y x
b
or y = x
b. Sample averages are on the regression hyperplane.

Simple linear regression
The case K = 2 is called the simple linear regression.
Geometrically, a straight line is tted through points x
i
, y
i
. Note
the property
b
1
= y b
2
x
for the intercept. Starting from (X
X)
1
X
y, some algebra yields

b
2
=
N
i =1
(x
i
x)(y
i
y)
N
i =1
(x
i
x)
2
for the slope.
A dummy regressor
If x
i
= 1 for some observations and x
i
= 0 otherwise (man/woman,
employed/unemployed, before/after 1989), x is called a dummy
variable. x will be the share of 1s in the sample.
b
1
is the average y for the 0 individuals y
[0]
, and b
2
is the
dierence between y
[1]
and y
[0]
.
Linear regression and matrix notation
Most issues are simpler and more compact in matrix notation:
X =
_
_
_
1 x
12
. . . x
1K
.
.
.
.
.
.
.
.
.
1 x
N2
. . . x
NK
_
_
_
=
_
_
_
x
1
.
.
.
x
N
_
_
_
, y =
_
_
_
y
1
.
.
.
y
N
_
_
_
.
The sum of squares to be minimized is now
S(
) = (y X

)
(y X

) = y
y 2y
X

+

X

,
and S(
)/
= 0 yields immediately b = (X
X)
1
X
y.
The projection matrix
The vector of observations is the sum of its systematic part and
the residuals
y = Xb + e = X(X
X)
1
X
y + e = y + e.
The matrix P
X
that transforms y into its systematic part
y = X(X
X)
1
X
y = P
X
y
is called the projection matrix. It is singular, N N of rank K,
and P
X
P
X
= P
X
. Note the orthogonality
P
X
(I P
X
) = 0
to the matrix M
X
= I P
X
that transforms y to residuals e.
Linear regression in a statistical model
Without a statistical frame, regression is a mere descriptive

device. The OLS coecients are not estimates. They can be
calculated, as long as X
X is non-singular (no
multicollinearity);
The classical didactic regression model assumes stochastic

(random) errors and non-stochastic regressors. Current
textbooks tend to avoid that model;
The stochastic regression model assumes X and y to be

realizations of random variables. Under conditions, operations
can be conducted conditional on X, and most classical results
can be retrieved.
Main ingredients of the statistical regression model
In the regression in matrix notation
y = X + ,
y and X collect realizations of observable random variables,
contains K unobserved non-random parameters, and collects
realizations of an unobserved random variable, the error term.
The assumption of random sampling implies that observations
are independent across the index i (the rows). A non-stochastic
regressor has the same values in a dierent sample. A
deterministic regressor has known values before sampling
(constant, trend).
Exogenous regressors
y is often called the dependent variable, x
k
, k = 1, . . . , K, are
the regressors or covariates (usage of the word independent
variable is discouraged).
If the exogeneity assumption on the regressors
E[
i
|x
i
] = 0
holds, it follows that E[y
i
|x
i
] = x
i
and also that E[
i
] = 0. With
non-stochastic regressors, this exogeneity assumption is
unnecessary. (There are also other denitions of exogeneity)
Estimator and estimate
An estimate is a statistic obtained from a sample that is designed
to approximate an unknown parameter closely: b is an estimate for
. The unknown but xed is called a Kdimensional parameter
or a Kvector of scalar parameters (real numbers). In regression, b
may be called the coecient estimates, may be called the
coecients.
An estimator is a rule that says how to obtain the estimate from
data. OLS is an estimator.
The residual e
i
approximates the true error term
i
but it is not an
estimate, as
i
is not a parameter. Dont be sloppy, never confuse
errors and residuals.
Gauss-Markov conditions
The Gauss-Markov conditions imply nice properties for the OLS
estimator. In particular, for the linear regression model
y
i
= x
i
+
i
, assume
A1 E
i
= 0, i = 1, . . . , N;
A2 {
1
, . . . ,
N
} and {x
1
, . . . , x
N
} are independent;
A3 V
i
=
2
, i = 1, . . . , N;
A4 cov(
i
,
j
) = 0, i , j = 1, . . . , N, i = j .
(A1) identies the intercept, (A2) is an exogeneity assumption,
(A3) is called the homoskedasticity assumption, (A4) assumes
the absence of autocorrelation.
Some implications of the Gauss-Markov assumptions
(A1), (A3), (A4) imply for the vector random variable that
E = 0 and V() =
2
I
N
, a scalar matrix;
(A1)(A4) imply that also E(|X) = 0 and V(|X) =

2
I
N
,
other types of exogeneity assumptions;
(A1) and (A2) imply that E(y|X) = X.

Unbiasedness of OLS
Under assumptions (A1) and (A2), the OLS estimator is unbiased,
i.e. Eb = , because of
Eb = E(X
X)
1
X
y = E{(X
X)
1
X
(X + )}
= + E{(X
X)
1
X
} =
Variance of OLS
Under assumptions (A1)(A4), the variance of the OLS estimator
follows the simple formula
V(b|X) =
2
(X
X)
1
,
because of
V(b|X) = E{(b )(b )
|X}
= E{(X
X)
1
X
X(X
X)
1
|X}
= (X
X)
1
X
E(
|X)X(X
X)
1
=
2
(X
X)
1
The Gauss-Markov Theorem
Theorem
Under the assumptions (A1)(A4), the OLS estimator is the best
linear unbiased estimator, i.e. the estimator among the linear
unbiased estimators with the smallest variance.
There are usually non-linear and/or biased estimators with a

smaller variance;
Smallness of variance is dened via deniteness, i.e.

1

2
if
2
1
non-negative denite;
The acronym BLUE stands for best linear unbiased

estimator. Gauss-Markov says that OLS is BLUE.
Estimating the variances
The OLS variance
2
(X
X)
1
is estimated by plugging in an
estimate for
2
= E(
2
) that uses residuals
s
2
=
1
N 1
N
i =1
e
2
i
, s
2
=
1
N K
N
i =1
e
2
i
,
with s
2
often preferred as it is unbiased. Note that
Es
2
=
2
, Es = .
The square roots of the variances in the diagonals of s
2
(X
X)
1
are the standard errors of the coecients b
k
.
Goodness of t
The R
2
The most customary goodness-of-t measure is the R
2
dened by
R
2
=
N
i =1
( y
i
y)
2
N
i =1
(y
i
y)
2
=
V( y)
V(y)
,
with

V denoting empirical variance and y the sample mean.
Orthogonality between y and e implies that
V(y) =

V( y) +

V(e),
such that
R
2
= 1
V(e)
V(y)
= 1
N
i =1
e
2
i
N
i =1
(y
i
y)
2
.
Goodness of t
Variants of R
2
If a regression is run without an intercept (homogeneous
regression), the uncentered R
2
R
2
0
= 1
N
i =1
e
2
i
N
i =1
y
2
i
becomes attractive. Comparison with the usual R
2
becomes
dicult.
If R
2
is interpreted as an estimate of the squared correlation
coecient of y and y, it is severely biased. The adjusted R
2
R
2
= 1
1
NK
N
i =1
e
2
i
1
N1
N
i =1
(y
i
y)
2
has less bias. It is not a panacea, however, for the problem that R
2
cannot be used for model selection.
The tstatistic
Additional to (A1)(A4), assume
A5 The errors
i
follow a normal distribution.
Then, it follows that the vector N(0,
2
I
N
). Clearly,
y|X N(X,
2
I
N
), and also the OLS coecient b follows a
normal distribution. It can be shown that the ratio
t
k
=
b
k

k
s
c
kk
is tdistributed with N K degrees of freedom. Here, c
kk
denotes
the k-th diagonal element of (X
X)
1
.
The ttest
The null hypothesis of interest is H
0
:
k
=
0
k
. Then, under the
null hypothesis, the tstatistic
t
k
=
b
k

0
k
s
c
kk
is t
NK
distributed. As N K increases, t
NK
approaches the
normal N(0, 1) distribution. It is customary to say that a
coecient is signicant whenever its t
k
for H
0
:
k
= 0 is larger
than 1.96 or even 2.
The Ftest
Assume the null hypothesis of interest is
H
0
:
KJ+1
= . . . =
K
= 0,
i.e. J restrictions on the coecients. Let S
0
denote the residual
sum of squares if OLS is applied to the restricted model with
K J regressors, and S
1
the RSS for the unrestricted model with
K regressors. One can show that the Fstatistic
F =
(S
0
S
1
)/J
S
1
/(N K)
is under H
0
distributed F(J, N K). As N gets large, the Ftest
becomes essentially a
2
(J) test.
Variants of the Ftest
For J = 1, F is the square of the tstatistic;
For J = K 1, the Fstatistic is a statistical counterpart of

the purely descriptive R
2
, S
1
is the regression RSS, S
0
is the
total sum of squares
(y
i
y)
2
. This is the only F shown in
a standard regression printout;
For any linear combinations of restrictions, such as

2
=
3
,
the concept is easily generalized, with S
0
corresponding to an
OLS regression under the specied restrictions and J the
number of linear independent restrictions.
ttest, Ftest, and Wald test
Often, the ttest and the Ftest in linear regression are referred to
as the Wald test. This is not quite accurate:
The Wald test is one out of three basic construction

principles for hypothesis tests in parametric inference, the
other two are the likelihood-ratio test and the Lagrange
multiplier test. These are not specic tests, but construction
principles for tests;
The main idea of the Wald test is to estimate the model

under the general (maintained) hypothesis and to check
whether the null hypothesis is fullled;
The Wald principle is attractive when the general unrestricted

model is easy to estimate but estimating the restricted model
is costly. In linear regression, all models are easily estimated.
pvalues
Currently, the signicance of test decision is reported in one of two
ways:
1. The value of the test statistic is compared to critical values of
the null distribution (often 1%,5%,10%). If the statistic
transgresses the 1% point, it receives more asterisks (stars)
than if it only transgresses the 10% point: star wars;
2. The marginal signicance level is calculated, at which the
decision whether to reject the null or not becomes
inconclusive: the pvalue. If the pvalue is less than 0.05, the
evidence against the null is signicant at 5%.
Remember that the pvalue is not the probability of the null
hypothesis. It should not be called probvalue.
Some language for hypothesis tests
Rejecting H
0
although it is correct is a type I error;
The probability of a type I error is called the size of the test.

If the size is not constant under H
0
, the maximum may be
called the size;
Not rejecting H
0
although it is incorrect is a type II error;
One minus the probability of a type II error (the probability of

correct rejection) is called the power of the test. The power
will typically not be constant on the alternative. It will
increase with the distance to the null. Close to the null, the
power will be close to the size.
Consistency of OLS
Introduce the assumption
A6 lim
N
X
X/N =
X
, a nite non-singular matrix,
which is stronger than no multicollinearity. It excludes asymptotic
multicollinearity as well as increasing regressors.
Theorem
Under the assumptions (A1)(A4),(A6), it holds that plimb = .
This property is also called in short b is consistent for .
Formally, it is dened by
lim
N
P{|b
k

k
| > } = 0 > 0 k.
Proof uses the Chebyshev inequality.
Remarks on OLS consistency
If g(.) is a continuous function, then the consistency of b for

implies the consistency of g(b) for g();
Consider the representation

b = (N
1
X
X)
1
N
1
X
y = + (N
1
X
X)
1
N
1
X
,
such that b converges to if the last term converges to 0.
The factor N
1
X
X converges to a xed matrix

X
by
assumption. There will be consistency if N
1
X
converges to
0, and this will hold true under conditions weaker than the
Gauss-Markov conditions, admitting some correlation among
errors and some heteroskedasticity.
Asymptotic normality of OLS
Theorem
Under assumptions (A1)(A4),(A6), it holds that
N(b ) N(0,
2
1
X
).
Again, for this result conditions weaker than the Gauss-Markov
conditions would suce. Note that the normality assumption (A5)
is not needed, according to the Central Limit Theorem. The
theorem implies that b is asymptotically distributed as
N(,
2
(X
X)
1
). Similarly, the corresponding tstatistics will be
asymptotically N(0, 1) distributed etc.
Linear regression only works as long as it is reasonable to view the
dependent variable as distributed Gaussian or similar (unimodal,
continuous, support is R). It fails when the dependent variable is
binary (0-1, buys a laptop-does not buy, employed-unemployed
etc.).
Two suggestions:
P(y
i
= 1|x
i
) = G(x
i
, ), known G is a link function with its
image in [0, 1];
i
= x
i
+ u
i
for the latent (hidden, unobserved) variable y
,
and y = 1 if y
exceeds a certain value.

Both suggestions lead to the same models.
Popular link functions
Often, the link function is seen as reecting utility or willingness
to pay. Thus, assume G : R [0, 1] to be monotonic, like a
distribution function G(x, ) = F(x
):
F(w) = (w), the normal cdf, denes the probit model;
F(w) = L(w) =
e
w
1+e
w
, the logistic cdf, denes the logit
model;
F(w) =
_
_
_
0, w 0;
w, 0 w 1;
1, w > 1.
denes the linear probability model.
Marginal eects
In linear regression,
k
reect the marginal eect of a change in
x
k
. In binary choice models, this issue is more complex, as the
marginal eects ( denotes the normal density)
(x
i
)
x
ik
= (x
i
)
k
;
L(x
i
)
x
ik
=
exp(x
i
)
{1 + exp(x
i
)}
2
k
,
depend on the values of the covariates. They are often computed
at sample averages.
The odds ratio
For p
i
= P(y
i
= 1|x
i
), the term
log
p
i
1 p
i
= x
is the log odds ratio. At the point of undecidedness, the log odds
ratio is 0. In the logit model, it is a linear function of the
covariates.
k
is the marginal reaction of the log odds ratio to a
change in x
k
.
Underlying latent model
Consider the second interpretation
y
i
= x
i
+
i
with y
i
= 1 if y
i
> 0 and y
i
= 0 otherwise. y
is an unobserved
latent variable (imagine utility etc.). Note
P(y
i
= 1) = P(y
i
> 0) = P(
i
x
i
) = F(x
i
),
with F the distribution function of the errors . Normal (logistic)
distribution for implies the probit (logit) model.
The likelihood of a binary choice model
The standard logit/probit model is fully parametric. The likelihood
is
L(; y, X) =
N
i =1
P(y
i
= 1|x
i
; )
y
i
P(y
i
= 0|x
i
; )
1y
i
,
which yields for the log-likelihood
log L(; y, X) =
N
i =1
log F(x
i
) +
N
i =1
(1 y
i
) log(1 F(x
i
)).
The maximum-likelihood estimator
Maximizing the likelihood L in yields the ML estimator

. There
is no closed form. Because of continuity, the ML estimator can be
obtained numerically by solving
log L()
=
N
i =1
y
i
F(x
i
)
F(x
i
)(1 F(x
i
))
f (x
i
)x
i
= 0,
with f = F
, the pdf of the distribution.

Multiresponse models

Micecx

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Micecx

Uploaded by

Copyright:

Available Formats

Basics Endogenous regressors Maximum likelihood Limited dependent variables Panel data

Samples are larger than for economic aggregates, often

Non-linear functional dependence may be identied from data

Qualitative or limited-dependent variables are common;

Individuals can often be regarded as independent, random

Individuals are often heterogeneous: heteroskedasticity;

The logical sequences in time are unavailable. Cause and

), equate its derivative to 0 (normal equations):

b. Sample averages are on the regression hyperplane.

y, some algebra yields

Without a statistical frame, regression is a mere descriptive

The classical didactic regression model assumes stochastic

The stochastic regression model assumes X and y to be

(A1)(A4) imply that also E(|X) = 0 and V(|X) =

(A1) and (A2) imply that E(y|X) = X.

There are usually non-linear and/or biased estimators with a

Smallness of variance is dened via deniteness, i.e.

The acronym BLUE stands for best linear unbiased

For J = 1, F is the square of the tstatistic;

For J = K 1, the Fstatistic is a statistical counterpart of

For any linear combinations of restrictions, such as

The Wald test is one out of three basic construction

The main idea of the Wald test is to estimate the model

The Wald principle is attractive when the general unrestricted

The probability of a type I error is called the size of the test.

One minus the probability of a type II error (the probability of

If g(.) is a continuous function, then the consistency of b for

Consider the representation

X converges to a xed matrix

exceeds a certain value.

F(w) = (w), the normal cdf, denes the probit model;

, the pdf of the distribution.

You might also like