Professional Documents
Culture Documents
Topic3 Regression Analysis
Topic3 Regression Analysis
Example: The dependence of crop yield on temperature, rainfall, sunshine, and fertilizer, for
example, is statistical in nature in the sense that the explanatory variables, although certainly
important, will not enable the agronomist to predict crop yield exactly because of errors involved
in measuring these variables as well as a host of other factors (variables) that collectively affect
the yield but may be difficult to identify individually. Thus, there is bound to be some “intrinsic”
or random variability in the dependent-variable crop yield that cannot be fully explained no
matter how many explanatory variables we consider.
In deterministic phenomena, on the other hand, we deal with relationships of the type, say,
exhibited by Newton’s law of gravity, which states: Every particle in the universe attracts every
other particle with a force directly proportional to the product of their masses and inversely
proportional to the square of the distance between them. Other examples include: Ohm’s law,
Boyle’s gas law, Kirchhoff’s law of electricity, and Newton’s law of motion.
Terminologies used
Y – dependent X– independent
Response stimulus.
Explained Explanatory
Predictand Predictor
Controlled Control
Regresand Regressor
Endogenous Exogeneous
If we are studying the dependence of a variable on only a single explanatory variable, such a
study is known as simple, or two variable, regression analysis. However, if we are studying the
dependence of one variable on more than one explanatory variable, the study is known as
multiple regression analysis. In other words, in two-variable regression there is only one
explanatory variable, whereas in multiple regression there is more than one explanatory variable.
i.e. ( ⁄ )
Where is the estimator of α, is the estimator of , and y is the estimator of ( ⁄ )
( )
( )
Methods of estimation
(i) Ordinary least square method (OLS)
(ii) Method of moments
(iii) Maximum likelihood estimation (MLE)
∑̂ ∑ ̂ ̂
∑ ∑̂ ̂∑
∑ ̂ ̂∑
∑( ̂ ̂ )
∑ ̂ ̂
which reduces to
∑ ̂∑ ̂∑
Solving Equations (d) and (c) simultaneously
∑ ̂ ̂∑
∑ ̂∑ ̂∑
∑ ̂ ∑
( ) ( ̂* ( )
∑ ∑ ∑
∑
( ) ∑ (∑ )
∑ ∑
∑ ∑
̂ ( ) ∑ ∑ ∑ ∑
∑ ∑
∑ ∑ ∑ ∑
̂
∑ ∑
∑
̂ ( ) ∑ ∑ ∑
∑ ∑
∑ ∑ ∑
̂
∑ ∑
2) Method of moments (An analogy principle)
The analogy principle is based on the following two assumptions:
i) The expectation of the error term is zero i.e.
ii)
If ̂ ̂
∑ ∑ ∑̂ ̂∑
∑ ∑ ̂ ̂∑
∑ ̂ ̂∑
This implies ∑ ̂ ̂∑
Multiplying equation (a) by , gives:
̂ ̂
Taking summations on both sides
∑ ∑ ̂∑ ̂∑ ∑
∑ ̂∑ ̂∑
̂∑ ̂∑ ∑
∑ ̂ ∑
( ) ( ̂* ( )
∑ ∑ ∑
∑
( ) ∑ (∑ )
∑ ∑
∑ ∑
̂ ( ) ∑ ∑ ∑ ∑
∑ ∑
∑ ∑ ∑ ∑
̂
∑ ∑
∑
̂ ( ) ∑ ∑ ∑
∑ ∑
∑ ∑ ∑
̂
∑ ∑
Example: Using the data in Example (M),
i) Specify the econometric model relating X and Y
ii) Estimate the coefficients of your model
Solution:
i) ̂ ̂ ̂ ̂
ii)
Recall
∑ ∑ ∑ ∑
̂
∑ ∑
∑ ∑ ∑
̂
∑ ∑
iii) ̂
̂ ̂ ̂
Where ̂ ̂ ̂ denote sample means from a sample of size 3 picked from a normal population
with mean and variance .
Solution
For unbiasedness
Recall: A good estimator is unbiased if ̂
̂ ( * [ ]
Therefore, ̂ is unbiased.
̂ ( * [ ]
Therefore, ̂ is unbiased.
̂ ( * [ ]
Therefore, ̂ is biased.
̂ ( * ( * [ ]
̂ [ ]
̂ ( * ( * ( * ( *
̂
̂ ( * ( *
[ ] [ ]
̂ 0.16
Therefore, ̂ is the most efficient estimator since it has the least variance.
Gauss-Markov Theory:
It states that “given the assumptions of the classical linear regression model, the least square
estimators have minimum variance’’ i.e. they are BLUE (best linear unbiased estimator).
The measure of coefficient of determination (the measure of goodness of fit)
The coefficient of determination (r2) (for a two variable model or R2 for multiple regression
model, i.e. it tells us how well the sample regression line fits the data.
In a two variable model the coefficient of determination, r2, tells us the variation in the dependent
variable as a result of variation in independent variable.
Coefficient of determination is obtained by squaring Pearson product moment correlation
coefficient i.e.
Interpretation: Approximately 96% of the changes in quantity supplied (Y) are due to changes
in price (X) or approximately 96% of the changes in Y are explained by changes in X.
Properties of r2
Its non-negative.
Its falls between 0 and 1 inclusive i.e. .
Limitations of r2 (R – squared).
r2 cannot determine whether the coefficient estimates and predictions are biased.
r2 does not indicate whether a regression model is adequate. You can have a low r2 value
for a good model or a high value for r2 for a model that does fit the data.
Testing goodness of fit with R – squared
R–squared is used to judge the explanatory power of the linear regression of the
dependent variable i.e. Y by the independent variable say X. It measures the dispersion of
the observation around the regression line.
The closer the observations to the line, the better are the explanation of the variables of Y
by changes in the explanatory variables.
Put differently R – squared shows the percentage of the total variation of the dependent
variable that can be explained by the independent variable. For decision purpose, a good
R-squared should be greater or equal to 0.5 that is explaining at least 50% of total
fluctuations in the dependent variable.
Note: Multilinearity is whereby the independent variables are correlated with one another.
The error term should be homoscedastic. (The variance of the errors is the same for all
value of independent variables).
Example;
The salary of employees in a certain factory depend on one’s age (Xi) number of children ever
born (X2) and years of experience (X3) and distance from home to factory (X4).
And is as given in the model below;
Xi, as age increases by one year, the employees’ salary will increase by approximately
3.2 shillings, cetris paribus.
X2, an increase in the number of children ever born by one child decreases one’s salary
by approximately 2.4 shillings.
X3 when the years of experience increase by one year, the average salary of an employee
increase by approximately 3 shillings.
X4 an increase in the distance from home to factory by 4 km, reduces the employees
salary by approximately 2.2 shillings.
It is clear that the relationship between output and the two inputs is nonlinear. However, if we
log-transform this model, we obtain.
( )…………………………………………………….(1)
let
………………………………………….(2)
Thus written, the model in equation (2) is nonlinear in variables Y and X but linear in the logs of
these variables. Therefore, equation (2) is a log-log, double-log, or log-linear model.
The properties of Cobb-Douglas production function are quite well known.
1. is the (partial) elasticity of output with respect to the labor input, that is, it measures the
percentage change in output for, say, a 1 percent change in the labor input, holding the capital
input constant.
2. Likewise, is the (partial) elasticity of output with respect to the capital input, holding the
labor input constant.
3. The sum gives information about the returns to scale, that is, the response of output
to a proportionate change in the inputs.
i) If , then there are constant returns to scale, that is, doubling the inputs will
double the output, tripling the inputs will triple the output, and so on.
ii) If , there are decreasing returns to scale i.e. doubling the inputs will less than
double the output.
i) If , there are increasing returns to scale i.e. doubling the inputs will
more than double the output.
Note: whenever you have a log–linear regression model involving any number of variables the
coefficient of each of the X variables measures the (partial) elasticity of the dependent variable Y
with respect to that variable. Thus, if you have a k-variable log-linear model:
………………….(3)
Each of the (partial) regression coefficients, through , is the (partial) elasticity of Y with
respect to variables through .
Example: extracted the way it is from Basic Econometrics by Domadar N. Gujarati
To illustrate the Cobb–Douglas production function, we obtained the data shown in Table 7.3;
these data are for the agricultural sector of Taiwan for 1958–1972.
The output elasticities of labor and capital were 1.4988 and 0.4899, respectively. We can interpret
this as follows:
Interpretation:
1.4988
Over the period of study, holding the capital input constant, a 1 % increase in the labor input led
on the average to about a 1.5 % increase in the output.
This is statistically significant (| | ).
0.4899
Holding the labor input constant, a 1 percent increase in the capital input led on the average to
about a 0.5 % increase in the output.
This is statistically significant ( | | ).
There was an increasing returns to scale. Therefore, over the period of the study, the Taiwanese
agricultural sector was characterized by increasing returns to scale
R2
The estimated regression line fits the data quite well. The R2 value of 0.8890 means that about
89 % of the variation in the (log of) output is explained by the (logs of) labor and capital.