You are on page 1of 18

Regression analysis

Regression analysis is defined as the study of dependency of a variable (dependent) on one or


more variables (independent) with a view of estimating/predicting values of population mean.
Examples
i) An economist may be interested in studying the dependence of personal consumption
expenditure on after-tax or disposable real personal income. Such an analysis may be
helpful in estimating the marginal propensity to consume (MPC), that is, average change
in consumption expenditure for, say, a dollar’s worth of change in real income.
ii) A monopolist who can fix the price or output (but not both) may want to find out the
response of the demand for a product to changes in price. Such an experiment may enable
the estimation of the price elasticity (i.e., price responsiveness) of the demand for the
product and may help determine the most profitable price.
iii) An agronomist may be interested in studying the dependence of crop yield, say, of wheat,
on temperature, rainfall, amount of sunshine, and fertilizer. Such a dependence analysis
may enable the prediction or forecasting of the average crop yield, given information
about the explanatory variables.

STATISTICAL VERSUS DETERMINISTIC RELATIONSHIPS


In regression analysis we are concerned with what is known as the statistical, not functional or
deterministic, dependence among variables, such as those of classical physics.
 In statistical relationships among variables we essentially deal with random or stochastic
variables. Random or stochastic variables are those variables that have probability
distributions.
 In functional or deterministic dependency, on the other hand, we deal with variables that
are not random or stochastic.

Example: The dependence of crop yield on temperature, rainfall, sunshine, and fertilizer, for
example, is statistical in nature in the sense that the explanatory variables, although certainly
important, will not enable the agronomist to predict crop yield exactly because of errors involved
in measuring these variables as well as a host of other factors (variables) that collectively affect
the yield but may be difficult to identify individually. Thus, there is bound to be some “intrinsic”
or random variability in the dependent-variable crop yield that cannot be fully explained no
matter how many explanatory variables we consider.

In deterministic phenomena, on the other hand, we deal with relationships of the type, say,
exhibited by Newton’s law of gravity, which states: Every particle in the universe attracts every
other particle with a force directly proportional to the product of their masses and inversely
proportional to the square of the distance between them. Other examples include: Ohm’s law,
Boyle’s gas law, Kirchhoff’s law of electricity, and Newton’s law of motion.

REGRESSION VERSUS CAUSATION


Although regression analysis deals with the dependence of one variable on other variables, it
does not necessarily imply causation.
REGRESSION VERSUS CORRELATION
Regression and correlation are closely related to but conceptually very much different.
In correlation analysis, the primary objective is to measure the strength or degree of linear
association between two variables. The correlation coefficient, measures this strength of
(linear) association. For example, we may be interested in finding the correlation (coefficient)
between smoking and lung cancer, between scores on statistics and mathematics examinations,
between high school grades and college grades, and so on.
In regression analysis, as already noted, we are not primarily interested in such a measure.
Instead, we try to estimate or predict the average value of one variable on the basis of the fixed
values of other variables. Thus, we may want to know whether we can predict the average score
on a statistics examination by knowing a student’s score on a mathematics examination.

Differences between regression and correlation


In regression analysis there is an asymmetry in the way the dependent and explanatory variables
are treated. The dependent variable is assumed to be statistical/random/stochastic, that is, to have
a probability distribution. The explanatory variables, on the other hand, are assumed to have
fixed values.
In correlation analysis, on the other hand, we treat any (two) variables symmetrically; there is no
distinction between the dependent and explanatory variables. After all, the correlation between
scores on mathematics and statistics examinations is the same as that between scores on statistics
and mathematics examinations. Moreover, both variables are assumed to be random.

Terminologies used
Y – dependent X– independent
Response stimulus.
Explained Explanatory
Predictand Predictor
Controlled Control
Regresand Regressor
Endogenous Exogeneous
If we are studying the dependence of a variable on only a single explanatory variable, such a
study is known as simple, or two variable, regression analysis. However, if we are studying the
dependence of one variable on more than one explanatory variable, the study is known as
multiple regression analysis. In other words, in two-variable regression there is only one
explanatory variable, whereas in multiple regression there is more than one explanatory variable.

Population regression function (PRF): is a line joins average points.


e.g. the data below shows the quality demanded (Y) of airtime vouchers at different prices (X) by
students A, B, C, D, and E. Construct a PRF/Curve.
Price
Student 500 1000 2000 5000
A 10 8 5 2
B 8 4 3 1
C 9 7 5 5
D 6 5 2 0
E 7 4 2 1
Average 8 5.6 3.4 1.8

Conditional population regression function/curve/line.

i.e. ( ⁄ )

Sample regression function.


Is a representation of population regression function which is given by:
  
y    x i

  
Where  is the estimator of α,  is the estimator of  , and y is the estimator of ( ⁄ )

Stochastic Vs deterministic relationship


Stochastic (random) follows probability distributions while deterministic relationships are
always exact and don’t follow the probability distribution. The population regression
econometric model is given by Y =
Where Y= dependent variable
= constant
= Independent
= Coefficient / parameter of independent variable
= is the error term/random/ stochastic/ white noise.

Reasons for including the error term


1) Omission of a relevant variable in a model. Reasons for omission are
 Some variables can be qualitative and cannot be quantified e.g. tastes and
preferences.
 Some factors cannot be predicted with certainty e.g. earthquakes, war etc.
 Some factors may not be known.
2) Incorrect specification of the mathematical model e.g. a model may be linearized when it
is already linear.
3) Randomness of human behavior.
Points that are spread on a scatter diagram are due to erotic behavior of human beings
that are inherent in them.
4) Aggregation of variables involves summing up behaviors of individuals that ate
dissimilar. When you aggregate, you cannot specify individual realities.
5) Need for parsimony (simplicity). This is the desire to keep the model as simple as
possible.
6) Errors of measurements.
Points off the line of best fit are due to errors of measurement that are committed during
the process of data collection, compilation and analysis.

Linearity in regression analysis


A variable is linear when it’s raised to power 1.
(a) Linearity of parameters.
 , it is linear in parameters since and β are raised to power 1.
 is not linear because β is raised to power 2.
(b) Linearity of variables
 , it is linear in variables because y and x are raised to power 1.
 , is not linear in variables because x is raised to power -1.

not linear in variables because


X is raised to negative 1.
Remarks
i) In regression analysis variables must be linear in parameters and variables can
either be linear or nonlinear.
ii) To estimate and β in the classical linear regression model
assumptions must be taken.

Assumptions of classical linear regression model based on the error term ( )


 The error term is random/stochastic. It can take on positive, negative or a zero value.
 The expected value of the error term is zero i.e.
 The error terms are uncorrelated i.e. ( ) for all .
 The variance of the error term is constant i.e. . Constant variance
is referred to as homoscedasticity.
Note: Unequal variance i.e. is called heteroscedasticity.
 The error term, and the explanatory variables are uncorrelated i.e. .
Proof.
[ ], b
= [ ]
= ( )
=
=0-0=0
 The error term is normally distributed with mean zero and constant variance i.e.
.
 The explanatory/independents are accurately measured in other wards the error term
absorbs the error.
Other assumptions
 The number of observations must be greater than the explanatory variables.
 The parameters must be constant (linear).
 The explanatory variables must be uncorrelated or no relationship between or among
explanatory variables (there is no multicolllinearity).
Review of Cramer’s method of solving simultaneous equations
Solve for x and y in
4 x  2 y  5..............(1)
3x  4 y  1..............(2)
Solution
 Express in matrix form
 4 2  x   5 
     
 3  4  y   1 
4 2 
Let A   , det A  22
 3  4

( )

( )

Estimation of a two variable model


  
y    xi  ui .................(1)
i

Methods of estimation
(i) Ordinary least square method (OLS)
(ii) Method of moments
(iii) Maximum likelihood estimation (MLE)

1) Ordinary least square method (OLS)


From (1), ̂ ̂ ̂
Summing both sides of (a), gives:

∑̂ ∑ ̂ ̂

Taking partial derivative of Equation (b) with respect to ̂, gives:


∑̂
∑( ̂ ̂ )
̂
∑̂
Equating ̂
to zero, gives:

∑ ∑̂ ̂∑

∑ ̂ ̂∑

Taking partial derivative of Equation (b) with respect to ̂ , gives:


∑̂
∑( ̂ ̂ )
̂
∑̂
Equating ̂ to zero, gives:

∑( ̂ ̂ )
∑ ̂ ̂

which reduces to
∑ ̂∑ ̂∑
Solving Equations (d) and (c) simultaneously

∑ ̂ ̂∑

∑ ̂∑ ̂∑

∑ ̂ ∑
( ) ( ̂* ( )
∑ ∑ ∑


( ) ∑ (∑ )
∑ ∑

∑ ∑
̂ ( ) ∑ ∑ ∑ ∑
∑ ∑

∑ ∑ ∑ ∑
̂
∑ ∑


̂ ( ) ∑ ∑ ∑
∑ ∑

∑ ∑ ∑
̂
∑ ∑
2) Method of moments (An analogy principle)
The analogy principle is based on the following two assumptions:
i) The expectation of the error term is zero i.e.

ii)

If ̂ ̂

This implies that ̂ ̂


Summing both sides of Equation (a), gives:
∑ ∑( ̂ ̂ )

∑ ∑ ∑̂ ̂∑

∑ ∑ ̂ ̂∑

∑ ̂ ̂∑

This implies ∑ ̂ ̂∑
Multiplying equation (a) by , gives:
̂ ̂
Taking summations on both sides

∑ ∑ ̂∑ ̂∑ ∑

∑ ̂∑ ̂∑

̂∑ ̂∑ ∑

Solving (b) and (c) simultaneously

∑ ̂ ∑
( ) ( ̂* ( )
∑ ∑ ∑


( ) ∑ (∑ )
∑ ∑

∑ ∑
̂ ( ) ∑ ∑ ∑ ∑
∑ ∑

∑ ∑ ∑ ∑
̂
∑ ∑


̂ ( ) ∑ ∑ ∑
∑ ∑

∑ ∑ ∑
̂
∑ ∑
Example: Using the data in Example (M),
i) Specify the econometric model relating X and Y
ii) Estimate the coefficients of your model
Solution:
i) ̂ ̂ ̂ ̂
ii)
Recall

 Y  590  X  110  XY  8380  X  1540 Y  46,100


2 2

∑ ∑ ∑ ∑
̂
∑ ∑
∑ ∑ ∑
̂
∑ ∑
iii) ̂

Properties of a good estimator


(a) A good estimator should be linear.
(b) It should be unbiased i.e. ̂ ( ̂) .
Note: ̂ ( ̂)
Bias is the difference between the expectation of the estimator and the population parameter.
a) It should have a minimum variance (it should be efficient) i.e. No any other estimator
should have a least variance than that of ̂ and ̂
b) A good estimator should have the least mean square error (MSE) i.e.
̂ √ ̂ .
c) A good estimator should be consistent i.e. as the sample size increases indefinitely; the
mean of the sample converges to the true population mean.
Example
Compare the following estimators and determine the best estimator

̂ ̂ ̂
Where ̂ ̂ ̂ denote sample means from a sample of size 3 picked from a normal population
with mean and variance .

Solution

 For unbiasedness
Recall: A good estimator is unbiased if ̂

̂ ( * [ ]

Therefore, ̂ is unbiased.

̂ ( * [ ]

Therefore, ̂ is unbiased.

̂ ( * [ ]

Therefore, ̂ is biased.

 Testing for efficiency


Recall: A good estimator is efficient if it has the smallest variance

̂ ( * ( * [ ]

̂ [ ]

̂ ( * ( * ( * ( *

̂
̂ ( * ( *

[ ] [ ]

̂ 0.16

Therefore, ̂ is the most efficient estimator since it has the least variance.

Gauss-Markov Theory:
It states that “given the assumptions of the classical linear regression model, the least square
estimators have minimum variance’’ i.e. they are BLUE (best linear unbiased estimator).
The measure of coefficient of determination (the measure of goodness of fit)
The coefficient of determination (r2) (for a two variable model or R2 for multiple regression
model, i.e. it tells us how well the sample regression line fits the data.
In a two variable model the coefficient of determination, r2, tells us the variation in the dependent
variable as a result of variation in independent variable.
Coefficient of determination is obtained by squaring Pearson product moment correlation
coefficient i.e.

Question. Using the data in example (M) compute r2 and interpret.


Solution:
Recall; rxy= 0.9792.

Interpretation: Approximately 96% of the changes in quantity supplied (Y) are due to changes
in price (X) or approximately 96% of the changes in Y are explained by changes in X.

Properties of r2
 Its non-negative.
 Its falls between 0 and 1 inclusive i.e. .
Limitations of r2 (R – squared).
 r2 cannot determine whether the coefficient estimates and predictions are biased.
 r2 does not indicate whether a regression model is adequate. You can have a low r2 value
for a good model or a high value for r2 for a model that does fit the data.
Testing goodness of fit with R – squared
 R–squared is used to judge the explanatory power of the linear regression of the
dependent variable i.e. Y by the independent variable say X. It measures the dispersion of
the observation around the regression line.
The closer the observations to the line, the better are the explanation of the variables of Y
by changes in the explanatory variables.
Put differently R – squared shows the percentage of the total variation of the dependent
variable that can be explained by the independent variable. For decision purpose, a good
R-squared should be greater or equal to 0.5 that is explaining at least 50% of total
fluctuations in the dependent variable.

Qn. Are low R-squared values essentially bad?


Answer
No, there are two major reasons why it can be just fine to have low R – squared values:
(i) In some fields, it is entirely expected that your R–squared values will be low. For
example any field that attempts to predict human behavior such as psychology,
typically has R-squared values lower than 50%. Humans are simply harder to predict
than say physical processes.
(ii) If your R-squared value is low but you have statistically significant predictors, you
can still draw important conclusions about how changes in the predictor values are
associated with change in the response value.
Regardless of R-squared, the significant coefficients still represent the mean change
in the response for unit of change in the predictor while holding other predictors in
the model constant.
Qn. Are high R-squared values intrinsically good?
Answer
No. a high R-squared does not necessarily mean/indicate that a model has a good fit.
Assumptions of the classical linear regression model under normality.
 The mean of the error term is zero i.e. .
 Variance of the error term is a constant i.e. .
 The error terms are uncorrelated i.e. ( ) for all .
 The error terms are not only uncorrelated but also normally identically and independently
distributed with zero mean and variance i.e.
Assumptions of ordinary estimators under normality
 They are linear.
 They are unbiased i.e. ( ̂ ) .
 They are efficient (has the least variance of all other estimators).
 They are consistent.
 ̂ and ̂ are independent.
Simple linear regression model
A simple linear regression model is of the form: ̂ ̂ ̂
Interpretation of a simple linear regression model
 ̂ is the average value of Y when X=0.
 ̂

When X increases by one unit, on average, Y will increase by approximately ̂ units.


Note: Assuming we had ̂ ̂ ̂ , we would interpret ̂ as follows: ``when X increases by
one unit, on average Y reduces by approximately ̂ units.’’
Example:
At Kabale coffee factory, the salary (Y) paid in Uganda Shillings depend on the number of days
worked for in a given month (X) and its determined by the model.
Y = 1000 + 5.7273Xi
s.e: (5.210) (0.4198)
t: (-0.7677) (13.6416), r2 = 0.9588, n=10.
(a) Interpret the model.
 An employee who has not worked for any day, on average, his salary will be
approximately 1000 Shs.
 When the number of days increases by one day, on average the salary will increase by
approximately 5.7273. This is statistically significant since t = 13.6416 is greater than 2.
 r2 = 0.9588= (0.9588 x 100) = 95.88% = 95.9%.
95.9% of the changes in the salary are due to changes in the number of days worked for.
Or changes in the number of days worked for explain 95.9% of the changes in one’s salary.
MULTIPLE REGRESSION ANALYSIS
In multiple linear regression models, the dependent variable Y is a linear function of K –
independent variables consider a case of multiple linear model below;
Y = Bo + B1X1 + B2X2 … + BkXk + U.
Where;
Y = dependent variable
XiX2…Xhe are the independent variables
U = is the error term
Bo = Constant term.
B1B2…. Bk are the slope coefficients.

Assumption of multiple linear regressions


 The relationship between independent variable (x) and dependent variable (y) is linear.
 The residuals or errors are independent with no presence of auto-correlation.
 The residuals follow a normal profitability distribution.
 There should not be multicollinearity. (The independent variables should not be
correlated with one another).

Note: Multilinearity is whereby the independent variables are correlated with one another.
 The error term should be homoscedastic. (The variance of the errors is the same for all
value of independent variables).

Example;
The salary of employees in a certain factory depend on one’s age (Xi) number of children ever
born (X2) and years of experience (X3) and distance from home to factory (X4).
And is as given in the model below;

Y = 50,000 + 3.2Xi – 2.4X2 + 3x2 + 3X3 – 2.2X4.


(a) Interpret the model.
(b) Find the salary of an employee who is aged 40 years, has 3 children, has worked for 10
years and his distance from home to the factory is 3km.
Solution:
(a) When we do not consider one’s age, number of children ever born, years of experience
and distance from home to the factory, on average, an employee would earn
approximately shs 50,000.

Xi, as age increases by one year, the employees’ salary will increase by approximately
3.2 shillings, cetris paribus.

X2, an increase in the number of children ever born by one child decreases one’s salary
by approximately 2.4 shillings.

X3 when the years of experience increase by one year, the average salary of an employee
increase by approximately 3 shillings.

X4 an increase in the distance from home to factory by 4 km, reduces the employees
salary by approximately 2.2 shillings.

(b) When X1 = 40 years X2 = 3, X3 = 10 and X4 = 3km


Y= 50,000 + 3.2(40) – 2.4(3) + 3(10) – 2.2 (3).
Y = 50,144.2.

The Cobb – Douglas production function.


With appropriate transformations we can convert nonlinear relationships into linear ones so that
we can work within the framework of the classical linear regression model. The various
transformations discussed there in the context of the two-variable case can be easily extended to
multiple regression models. The specific example we discuss is the celebrated Cobb–Douglas
production function of production theory.
The Cobb – Douglas production function in its stochastic form may be expressed as;
……………………………………..(*)
Where;
Y = output
= Labour inputs.
= Capital input
u = Stochastic disturbance term (error term)
e = Base of natural logarithm.

It is clear that the relationship between output and the two inputs is nonlinear. However, if we
log-transform this model, we obtain.

( )…………………………………………………….(1)

let
………………………………………….(2)
Thus written, the model in equation (2) is nonlinear in variables Y and X but linear in the logs of
these variables. Therefore, equation (2) is a log-log, double-log, or log-linear model.
The properties of Cobb-Douglas production function are quite well known.
1. is the (partial) elasticity of output with respect to the labor input, that is, it measures the
percentage change in output for, say, a 1 percent change in the labor input, holding the capital
input constant.
2. Likewise, is the (partial) elasticity of output with respect to the capital input, holding the
labor input constant.
3. The sum gives information about the returns to scale, that is, the response of output
to a proportionate change in the inputs.
i) If , then there are constant returns to scale, that is, doubling the inputs will
double the output, tripling the inputs will triple the output, and so on.
ii) If , there are decreasing returns to scale i.e. doubling the inputs will less than
double the output.
i) If , there are increasing returns to scale i.e. doubling the inputs will
more than double the output.
Note: whenever you have a log–linear regression model involving any number of variables the
coefficient of each of the X variables measures the (partial) elasticity of the dependent variable Y
with respect to that variable. Thus, if you have a k-variable log-linear model:
………………….(3)
Each of the (partial) regression coefficients, through , is the (partial) elasticity of Y with
respect to variables through .
Example: extracted the way it is from Basic Econometrics by Domadar N. Gujarati
To illustrate the Cobb–Douglas production function, we obtained the data shown in Table 7.3;
these data are for the agricultural sector of Taiwan for 1958–1972.
The output elasticities of labor and capital were 1.4988 and 0.4899, respectively. We can interpret
this as follows:

Interpretation:

 1.4988
Over the period of study, holding the capital input constant, a 1 % increase in the labor input led
on the average to about a 1.5 % increase in the output.
This is statistically significant (| | ).

 0.4899
Holding the labor input constant, a 1 percent increase in the capital input led on the average to
about a 0.5 % increase in the output.
This is statistically significant ( | | ).


There was an increasing returns to scale. Therefore, over the period of the study, the Taiwanese
agricultural sector was characterized by increasing returns to scale

R2
The estimated regression line fits the data quite well. The R2 value of 0.8890 means that about
89 % of the variation in the (log of) output is explained by the (logs of) labor and capital.

You might also like