5 - Regression

Basics of Regression Hypothesis testing Interpreting results Model specification Robust standard errors Penalized regression
Lecture 5 Regression
ET2013 Introduction to Econometrics
Giacomo Pasini
Ca’ Foscari University of Venice

Lecture overview
1 Basics of Regression
2 Hypothesis testing
3 Interpreting results
4 Model specification
5 Robust standard errors
6 Penalized regression
reference:NHK (2022) Ch. 13

Outline
Univariate regression
Regression is the most common way of estimating the relationship between two
variables while controlling for others, allowing you to close back doors with those
controls
OLS fit a function linear in the parameters in order to explain Y with X :
Y = β0 + β1 X
.
OLS estimates β̂ 0 and β̂ 1 that minimize the sum of squared residuals
We can interpret β 1 as a slope. So, a one-unit increase in X is associated with a β 1
increase in Y .
cov (X ,Y )
with only one regressor, β̂ 1 = var (X )
FWL Theorem and multivariate regression
If we plug an observation’s X into an estimated regression, we get a prediction

Ŷ = β̂ 0 + β̂ 1 X This is the part of Y that is explained by the regression.
The difference between Y and Ŷ is the unexplained part, which is also called the
residual.
FWL theorem: If we add another variable to the regression equation
Y = β 0 + β 1 X + β 2 Z , then the the coefficient on each variable will be estimated
using the variation that remains after removing what is explained by the other
variable.
So β̂ 1 would give the best-fit line between the part of Y not explained by Z and
the part of X not explained by Z .
Residuals and error term
The residual is the difference between Y and Ŷ :
ε̂ = Y − Ŷ = Y − ( β̂ 0 + β̂ 1 X + β̂ 2 Z )
The error term is the difference between Y and the true best fit line in the
population:
ε = Y − ( β0 + β1 X + β2 Z )
Y = β0 + β1 X + β2 Z + ε
Residuals and error term
Because we only have a finite sample,

the best-fit line we get with the data
we have won’t quite be the same as
the best-fit line we’d get if we had data
on the whole population.
All we can really see will be the
residual, but we need to keep that error
in mind.
Error term and causal diagrams
The error ε contains everything that

causes Y that is not included in the
model
If the true model is as in the diagram
and we are using
Y = β0 + β1 X + β2 Z + ε
then ε is a combination of A, B and C

Assumption 1 on the error term: exogeneity
Assuming X is exogenous means we assume we are identifying a causal effect from

X to Y
In regression terminology, X is exogenous if it is uncorrelated with ε.
If X is exogenous, β̂ 1 (the estimated coefficient of X ) will, on average, be equal to
the true β 1
Exogeneity and causal diagrams
We estimate
Y = β0 + β1 X + β2 Z + ε
Since Z is in the regression, the path X ←Z

→Y is closed
C is not in the regression, but there is no back
door from X to Y through C , so this is not a
problem
A is not in the regression, so the back door path
through A (X ←A →Y ) is open
X is endogenous, i.e. the causal effect of X on Y is NOT identified

Exogeneity and causal diagrams
We estimate
Y = β0 + β1 X + β2 Z + ε
C is a predictor of Y but is uncorrelated with X ,

so it can be left out of the regression
A correlates with X , so X and ε are correlated
as well and X is endogenous
Think to Y as income, X years of work
experience, Z years of schooling and A innate
ability
X is endogenous, β̂ 1 ̸= β 1 and β 1 − β̂ 1 is the (omitted variables) bias

Assumption 2 on the error term: normality
OLS Regression coefficients’ estimates follow a normal distribution
Univariate regression Y = β 0 + β 1 X + ε:
σ2

1
β̂ 1 ∼ N β 1 ,
n var (X )
So how can we make an OLS estimate’s sampling variation small?
1 We could shrink the standard deviation of the error term σ, i.e., make the model
predict Y more accurately
2 We could pick an X that varies a lot: it makes it easier to check for whether Y is
changing in the same way.
3 We could use a big sample so n gets big
Assumption 2 on the error term: normality
Multivariate regression Y = β 0 + β 1 X + β 2 Z + ε
2
β̂ 1 β1 σ ′ −1
∼ , (W W )
β̂ 2 β2 n
W contains both X and Z , so we divide by the variances AND covariances of X
and Z
Standard deviations of β̂ 1 and β̂ 2 are the square root of the diagonal elements of
σ 2 (W ′ W ) −1
n
Jargon: std deviation of a sampling distribution is often referred to as standard error
Outline
Hypothesis testing in OLS
why did we want to know the OLS coefficient distribution? So we can use what we
observe to come to the conclusion that certain theoretical distributions are unlikely.
Thus given the assumptions we’ve made, we can use our estimate β̂ 1 to say that
certain population parameters β 1 are very unlikely
Example “it’s unlikely that the effect of X on Y is 13.2”, or (MUCH) more often
“it’s unlikely that the effect of X on Y is 0”
Hypothesis testing in OLS
1 we pick a “null hypothesis” H0 - the value we’re going to check against.

In practice, this is almost always zero, so let’s simplify and say that our null
hypothesis is H0 : β 1 = 0.
2 we pick a “rejection value” α . If we calculate that the probability of getting our
estimated result from the theoretical distribution based on β 1 = 0 is below α, then
we say that’s too unlikely, and conclude that β 1 ̸= 0, rejecting the null.
Usually α = 0.05, so if the estimate we get has less than a 5% chance of occurring if
the null value is true, then we reject the null.
Hypothesis testing, example

200 observations generated using
Y = 3 + 0.2X + ε, ε ∼ N (0, 1)
β̂ 1 = 0.142, se ( β̂ 1 ) = 0.077
The theoretical distribution under the null H0 : β 1 = 0 is β 1 ∼ N (0, 0.0772 )
0.142 is at the 96.7th percentile
If the null is true, we get an estimate equal to
±0.142 or even further from 0 (100 − 96.7) × 2
= 3.3 × 2 = 6.6% of the times. 6.6% is the
p–value.
If α = 0.05 we DO NOT reject the null:
6.6% > 5%
NB: if α = 0.10 we DO reject the null! What’s
going on?
Type I and type II error
In the previous example, at 95% significance level (α = 0.05) we accept the null
even if it wrong (type II error, false negative)
False positives, type I error: the null is rejected even if it’s true
Test result
True value is accept reject
true ✓ type I error
false type II error ✓
With lower α (0.05 < 0.10) / higher significance level (95%>90%):
Wider confidence interval for the null, i.e. wider acceptance region
higher probability of wrongly failing to reject the null hypothesis (type II error)
lower probability of wrongly rejecting the null hypothesis (type I error)
Type I and Type II error cannot be minimized simultaneously!
Outline
Regression tables
Research question: Do chain restaurants get better inspections than restaurants

with fewer locations?
Dependent variable: Inspection score (0-100)
Coefficient estimates
In parenthesis, se of coeff (or, less
frequently, t-stat)
Significance stars
Num of observations
Goodness of fit measures
Significance stars
These let you know at a glance whether the coefficient is statistically significantly
different from a null-hypothesis value of 0.
They’re a representation of the p-value (probability of being as far away from the
null hypothesis value (or farther) as our estimate actually is)
If the p-value is below α, that’s statistical significance (we reject the null of coeff
equal to 0).
Coeff of Number of Locations has ***: that means that if we had decided that
α = 0.01 (or higher), we would reject the null of β 1 = 0
Goodness of fit: R 2 and Adjusted R 2
Measures of the share of the dependent variable’s variance that is predicted by the
model
Fist column: R 2 = 0.065 , 6.5% of the variation in Inspection Score is predicted by
the Number of Locations.
If we were to predict Inspection Score with Number of Locations, we’d be left with a
residual variable that has only 100 − 0.065 = 0.935, 93.5% of the variance of
Inspection Score.
Problem: adding any variable to a model always makes the R 2 go up by some small
amount, even if the variable doesn’t make any sense.
Adjusted R 2 : it only counts the variance explained above and beyond what you’d
get by just adding a random variable to the model.
Other Goodness of fit measures
We have the “F-statistic”. Test stat for the null that all the coefficents in the model
(except the intercept/constant) are all zero at once. This is pretty useless...
RMSE: estimate of the standard deviation of the error term.
1 Take predicted Inspection Score based on OLS estimates and subtract from the actual
values to get a residual.
2 Calculate the standard deviation of that residual
3 Make a slight adjustment for the “degrees of freedom” (number of observations in the
data minus number of coefficients in the model).
4 If RMSE is big, the average errors in prediction for the model are big.
Several other measures (AIC, BIC...)
Are goodness of fit measures useful?
These are measures of how well your dependent variable is predicted by your OLS
model
If you are after a causal effect of a variable on another, you are interested in
estimating specific coefficients well
If you are confident you’ve included the variables in your model necessary to identify
your treatment (i.e., you closed all back doors), then R 2 is of little importance.
Take home message: do NOT fixate on R 2 , you do not have to maximize it!
A one unit increase in...
Regression coefficients are slopes of linear relationships between two variables

They represent the relationship between a one-unit change in the regressor and the
dependent variable
“a one-unit increase in the number

of locations a chain restaurant has
is linearly associated with a
−0.019 point decrease in inspector
score, on a scale of 0-100.”
“Comparing two restaurants, the
one that’s part of a chain with one
more location than the other will
on average have an inspection
score 0.019 lower.”
Associations vs Causal Effect
If, based on what we learned in the first part of the book, we think we’ve identified
the causal effect of number of locations on inspector score, you can word it
accordingly
“a one-unit increase in number of locations decreases inspector score by −0.019.”
If you’re not sure you identified a causal relation, then regression coefficients are
partial correlations and measure associations
From Causal diagram to Regression

Outline
Binary variables
Binary variables (also called dummy variables) are common in social science: are
you a man or a woman? Are you Catholic or not? Are you married or not?
Especially important for causal analysis: the causes we tend to be interested in are
binary in nature. Did you get the treatment or not?
Binary variables can be included in regression models just as normal. So we can still
be working with
Y = β0 + β1 X + β2 Z + ε
but now X or Z (or both) can only take two values: 0 (“are not”/false) or 1
(“are”/true).
If the binary variable is a control variable, we’re just shutting off back doors that go
through the variable.
Interpreting Dummy variables coefficients
The coefficient of a dummy variable is the difference in the dependent variable

between the trues and the falses.
We estimate the regression Sales = β 0 + β 1 Winter + ε
β̂ 1 = −5: on average, sales are 5

lower in Winter than they are in
Not Winter.
The coefficient on Winter (-5) is
just the difference between the
mean for non-winter (15) and the
mean for winter (10)
β̂ 0 = 15: it is the expected mean of Sales when all the variables are zero: when
Winter is 0, we’re in Not Winter, and average sales in Not Winter is 15.
OLS Assumption 3: no perfect multicollinearity
We estimate the regression Sales = β 0 + β 1 Winter + ε

we CANNOT estimate Sales = β 0 + β 1 Winter + β 2 NotWinter + ε
No perfect Multicollinearity
Linear combinations of the regressors (right hand side variables) cannot be regressors
themselves
Winter + NotWinter = 1, there’s in the model a “hidden” regressor that is equal to
1 for all the observations: it is multiplied by the constant
So either we drop NotWinter (or Winter, of course), or we should run the
regression with no intercept. I suggest to always keep the constant otherwise
interptation of coefficients changes
OLS Assumption 3: no perfect multicollinearity
Why this assumption?

Previous example: average sales in Winter is 10, and in Not Winter it’s 15.
If the regression is Sales = β 0 + β 1 Winter + β 2 NotWinter + ε, you could generate
those exact same predictions with β 0 = 15, β 1 = −5, β 2 = 0. Or with
β 0 = 10, β 1 = 0, β 2 = 5. Or β 0 = 3, β 1 = 7, β 2 = 12 and any infinite number of
other ways
OLS has no way of picking one estimate out of those infinite options: it can’t give
you a best estimate any more.
If we drop NotWinter we force β 2 = 0 and thus β 0 = 15, β 1 = −5.
Categorical variables
Categorical variables (e.g. Country you live in) can be recoded as a set of binary
variables
you live in Italy yes/no, etc
You need to exclude one category to avoid perfect multicollinearity with the
constant
e.g., you exclude US
Coefficients’ estimates are the difference between each category and the reference
category
Difference in average Y between Italy and US
If you want to know whether a categorical variable has a significant effect as a
whole, you look at all the category coefficients. This takes the form of a “joint F
test”
Polynomials
OLS must be linear in parameters but it can be non–linear in variables
Here a quadratic in MathScores

fits clearly better the data
In general we want to estimate
models like
Y = β0 + β1 X + β2 X 2 + β3 X 3
How can we interpret results?

Marginal effects changing with X
In this case, it is impossible to move X “holding everything else constant”

We compute the derivative:
∂Y
= β 1 + 2β 2 X + 3β 3 X 2
∂X
The effect of X on Y dependes on the specific value of X, i.e. varies with X
Example Polynomials
For a restaurant with just one
branch, addign a second one would
be associated with a
−0.0802 + 2 × 0.0001(1) = −0.08
reduction in inspection score
But for a chain with a 1000
restaurant adding one would
increase the score by
−0.0802 + 2 × 0.0002(1) =
0.1198
In these cases two things to keep in mind:
What is the support
of X? I.e., are all values for X equally plausible?
Compute se ∂Y ∂X : if the effect is both positive and negative, it cannot be always
significantly different from zero!
Support of X and marginal effect of polynomials
In this case, at the min (1) the marginal effect is negative, at max (646) is positive
Still, at 75th percentile it is still negative.
For most values the effect is negative
Significance of marginal effects
Brambor, Clark, Golder (2006) “Understanding Interaction Models: Improving Empirical

Analyses”, Political Analysis 14:63–82
Log transformation: reduce the skew
We saw that a variable is “right skewed” if there are a whole lot of observations with
low values and just a few observations with really high values. Income is an example
In this case, you might want to reduce the skew and run Y = β 0 + β 1 log(X ) + ε
The weight of the outlier (i.e. the

difference in OLS fit with/without
the very big observation) goes
down substantially with log
Estimating a production function
Another reason you might want to use log(X ) rather than X in the regression is
because your DGP has logs in it
We model firms behavior. A common choice is to use a Cobb-Douglas for the
production function
Y = αK β Lγ
A natural nonlinear econometric model meant to bring it to the data is
Y = αK β Lγ + u
Where u is the error, identically distributed for all observations as a Normal

Assuming u is IID is not innocuous: since it is measured in units of output, if we

assume it to be IID its order of magnitude does not vary with the scale of
production
The model can be rewritten with a multiplicative error term:
Y = αK β Lγ + u ≡ (αK β Lγ )(1 + ε)
Now ε is dimensionless and if it is IID, then u is proportional to Y (it varies with

the scale of production).
If ε is reasonably small, it holds that 1 + ε ∼

= e ε and
Y = αK β Lγ e ε
taking logs of both sides we obtain the loglinear regression model, which is linear
in the parameters and in the logs of all the variables
log Y = log α + β log K + γ log L + ε

Elasticities
Note that in a loglinear model parameters are elasticities.
∂Y K ∂ log Y
· ≈ =β
∂K Y ∂ log K
In the linear model the elasticity is given by
∂Y K K
· = β·
∂K Y Y
The linear model implies non-constant elasticities, while the loglinear model
imposes constant elasticities. This is ok if the underlying model is a Cobb-Douglas,
but it is not always desirable.
Semi–Elasticities
If the log of the dependent variable is regressed on a right hand side variable
expressed in levels, its coefficient measures the expected relative change in Y due
to an absolute change in X . This is a semi-elasticity.
For example, if X is a dummy for males, β = 0.10 tells us that the relative wage
differential between men and women is 10%.
Interpretation of the coefficients
Model Dependent variable Regressor Interpretation of β k

level-level y x ∆y = β k ∆x
level-log y log(x ) ∆y = ( β k /100) %∆x
log-level log(y ) x %∆y = (100β k ) ∆x
log-log log(y ) log(x ) %∆y = β k %∆x
Inverse Hyperbolic Sine transformation
what if data contains lots of zeros in X ? log 0 is undefined!

Example: wealth hold in different assets;
Solution 1: replace zeros with (meaningful) small numbers
What if X contains also negative numbers? E.g., net financial wealth?
Solution 2: Use a different zero-preserving monotonic transformation instead of log
Inverse Hyperbolic Sine transformation:
p
asinh(X ) = log X + X 2 + 1
For X large enough, asinh(X ) converges to log X

this means asinh can be used to approximate percentage changes exactly as with logs
Interaction Terms
What if the relationship between Y and X differs based on the value of a different
variable Z ?
Example, what’s the relationship between the price of gas and how much an
individual chooses to drive?
For people who own a car, that relationship might be quite strong and negative.
For people who don’t own a car, that relationship is probably near zero.
For people who own a car but mostly get around by bike, that relationship might be
quite weak.
Solution: include the interaction X × Z and also Z in the regression:
Y = β 0 + β 1 X + β 2 Z + β 3 XZ + ε
Interaction Terms: interpretation
What if the effect of X on Y when there is an interaction term?

∂Y
= β1 + β3 Z
∂X
There is not a single effect of X
We can compute the effect of X on Y at different values of Z
β 3 itself is how much stronger (weaker) the effect of X on Y is if Z increases by
one unit
∂Y
∂X
= β3
∂Z
Outline
Sampling distribution of β̂
We estimate the regression
Y = β0 + β1 X + ε
We obtain β̂ 1 and se ( β̂ 1 )
If what we assume on ε is correct, we then know the sampling distribution of β̂ 1 ,
and we can do inference (i.e. do hypothesis testing and say something on the
population, not only on the sample)
Assumptions on ε distribution are crucial!
We assumed that:
ε is Normally distributed
ε is Indipendently and identically distributed (IID)
Normality
If ε is not normally distributed, we can invoke the Law of Large Numbers (LLN)
and the Central Limit Theorem (CLT)
If ε is IID but not normally distributed, then asymtotically (i.e., if the sample is
large enough) β̂ 1 is normally distributed.
So, normality is not really crucial
IID
ε must be Independent:
unrelated to the error terms of other observations
unrelated to the other variables for the same observation
..and Identically Distributed:
the error term might be different from observation to observation but it will always
have been drawn from the same distribution.
Autocorrelation
Error terms are correlated with each other in some way.
This commonly pops up with data over multiple time periods (temporal
autocorrelation) or data that’s geographically clustered (spatial autocorrelation).
Temporal autocorrelation: you’re regressing the US unemployment rate growth on
year.
The economy tends to go through up and down

spells that last a few years
The errors here tend to be a few positive errors
in a row, then a few negative errors.
The distribution can’t be independent: error
from period t − 1 predicts the error at time t.
Heteroskedasticity
the variance of the error term’s distribution is related to the variables in the model.
Example: you regress how many Instagram followers someone has on the amount of
time spent posting daily.
Among people who spend almost no time,

there’s very little variation in followers’ count.
But among people who spend a lot of time,
there’s huge amounts of variation.
The error distribution isn’t identical across
individuals because the variance of the
distribution changes with values of the variables
in the model.
Distribution of β̂ if ε is not IID
If the IID assumtion is not respected, OLS estimates of the parameterts β̂ are still
normally distributed in big samples (i.e., LLN and CLT still applies)
But se ( β 1 ) is no more √ σ
nvar (X )
In multivariate regression, the Variance and Covariance matrix of the vector of OLS
2
coefficients estimates is no more σn (W ′ W )−1
Solution: instead of √ σ we use a “sandwich estimator”: the individual values
nvar (X )
of X that go into var (X ) calculation are scaled up or down, or even be multiplied
by other observations’ X values
Huber–White correction for heteroskedasticity
Huber-White is a common heteroskedasticity-robust sandwich estimator method

The scaling of the individual X values works by taking each observation’s residual,
squaring it, and applying that squared residual as a weight to do the scaling.
This weights observations with big residuals more when calculating the variance.
Newey-West correction for temporal autocorrelation
Newey-West is a method to obtain heteroskedasticity- and TEMPORAL

autocorrelation-consistent (HAC) standard errors
Start with the Huber-white heteroskedasticity-robust standard error estimate
Pick a number of lags - the number of time periods over which you expect errors to
be correlated.
Then, calculate how much autocorrelation we observe in each of those lags
Add them together - with shorter-length lags being featured more heavily in the sum
That’s used to make the adjustment factor
Clustered standard errors
Research question: the effect of providing laptops to students on their test scores.
A given classroom of students isn’t just given laptops or not, they are also in the
same classroom with the same teacher.
The test scores of classmates will be similar based on having laptops (captured by
the regressor), and because of the similar environments they face (in the error term)
Their errors will be correlated with each other: these errors are clustered within
classroom.
Liang-Zeger standard errors:
explicitly specify a grouping, such as classrooms.
Then, lets X values interact with other X values in the same group
Outline
Too many back doors
It’s not uncommon to end up with Causal Diagrams that leave us with far too
many candidates for inclusion as controls.
What do we do if we have thirty, fifty, a thousand potential control variables that
might help us close a back door?
Including all of them would be a statistical mess
We probably want to drop some of those controls. But which ones?
We’d like some sort of model selection procedure that would do the choosing for us.
LASSO regression
OLS picks coefficients minimizing the sum of squared residuals:

n 2 o
argminβ ∑ Y − Ŷ
LASSO picks them still minimizing SSR, but also make the function of β small
n o
argminβ ∑ Y − Ŷ + λ ∑ | β|
2
Effect of penalization:
in LASSO β̂ is big only if it really helps to reduce SSR
Sends a lot of coefficients to zero, dropping the controls
Choice of λ
λ is typically chosed with cross validation

Split the data up into random chunks.
Pick a value of λ.
Then you drop each of those chunks one at a time, and use the remaining chunks
to predict the values of the dropped chunk
Then bring that dropped chunk back and drop the next chunk, estimating the
model again and predicting those dropped values.
Repeat until all the chunks have been dropped once, and evaluate how good the
out-of-sample prediction is for the λ you picked.
Then, pick a different value and do it all over again.
Repeat over and over, and choose the λ with the best out-of-sample prediction.
Careful with LASSO!
LASSO is good to select the regressors, then to estimate the coefficients is better
to use OLS on the subset of the selected controls
If your Causal Diagram tells you a control closes a very important back door and
LASSO drops it...well put it back into the regression!
Since the size of coefficients is sensitive to the scale of the parameters, standardize
them before running LASSO for selection, then in the OLS regression you use the
original ones, if you prefer
X − mean(X )
sd (X )

5 - Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 - Regression

Uploaded by

Copyright:

Available Formats

Basics of Regression Hypothesis testing Interpreting results Model specification Robust standard errors Penalized regression

Ca’ Foscari University of Venice

5 Robust standard errors

reference:NHK (2022) Ch. 13

5 Robust standard errors

FWL Theorem and multivariate regression

If we plug an observation’s X into an estimated regression, we get a prediction

Residuals and error term

The residual is the difference between Y and Ŷ :

Residuals and error term

Because we only have a finite sample,

Error term and causal diagrams

The error ε contains everything that

then ε is a combination of A, B and C

Assumption 1 on the error term: exogeneity

Assuming X is exogenous means we assume we are identifying a causal effect from

Exogeneity and causal diagrams

Since Z is in the regression, the path X ←Z

X is endogenous, i.e. the causal effect of X on Y is NOT identified

Exogeneity and causal diagrams

C is a predictor of Y but is uncorrelated with X ,

X is endogenous, β̂ 1 ̸= β 1 and β 1 − β̂ 1 is the (omitted variables) bias

Assumption 2 on the error term: normality

OLS Regression coefficients’ estimates follow a normal distribution

Assumption 2 on the error term: normality

5 Robust standard errors

Hypothesis testing in OLS

Hypothesis testing in OLS

1 we pick a “null hypothesis” H0 - the value we’re going to check against.

Hypothesis testing, example

Type I and type II error

5 Robust standard errors

Research question: Do chain restaurants get better inspections than restaurants

Goodness of fit: R 2 and Adjusted R 2

Other Goodness of fit measures

Are goodness of fit measures useful?

A one unit increase in...

Regression coefficients are slopes of linear relationships between two variables

“a one-unit increase in the number

Associations vs Causal Effect

From Causal diagram to Regression

5 Robust standard errors

Interpreting Dummy variables coefficients

The coefficient of a dummy variable is the difference in the dependent variable

β̂ 1 = −5: on average, sales are 5

OLS Assumption 3: no perfect multicollinearity

We estimate the regression Sales = β 0 + β 1 Winter + ε

OLS Assumption 3: no perfect multicollinearity

Why this assumption?

OLS must be linear in parameters but it can be non–linear in variables

Here a quadratic in MathScores

How can we interpret results?

Marginal effects changing with X

In this case, it is impossible to move X “holding everything else constant”

Support of X and marginal effect of polynomials

Significance of marginal effects

Brambor, Clark, Golder (2006) “Understanding Interaction Models: Improving Empirical

Log transformation: reduce the skew

The weight of the outlier (i.e. the

Estimating a production function

Where u is the error, identically distributed for all observations as a Normal

Estimating a production function

Assuming u is IID is not innocuous: since it is measured in units of output, if we

Now ε is dimensionless and if it is IID, then u is proportional to Y (it varies with