Professional Documents
Culture Documents
Misgana D.(Bsc,MSc)
The classical simple regression model
• After completing this unit the student will, among
others,
– be able to differentiate regression analysis from
correlation analysis
– be able to apply ordinary least squares (OLS)
method in a two variable regression analysis and
interpret the results.
– conduct a measure of goodness of fit of regression
estimates.
– construct hypothesis testing procedure for regression
coefficients
– apply the regression result to forecasting (prediction)
2.1 REGRESSION Vs CORRELATION
• Correlation Analysis: Measures the strength/degree and direction of
linear association between two variables (both are assumed to be
random)
• r = n 𝑿𝒀 − 𝑿 𝒀
𝒏 𝑿𝟐 − 𝑿 𝟐. 𝒏 𝒀𝟐 − 𝒚 𝟐
• Pearson‟s Correlation Coefficient (r) is a number between -1&1
which measures the degree to which two variables are linearly related.
• It also tells you three things about the relationship:
1. Strength? 2. Direction? 3. Significant?
• How strong is the relationship? How big is the number?
1.0 (-1.0) = Perfect Correlation
0.60 to 0.99 (-0.60 to -0.99) = Strong
0.30 to 0.59 (-0.30 to -0.59) = Moderate
0.01 to 0.29 (-0.01 to -0.29) = Weak
0 = No Correlation
• When P-value is below 0.05, the correlation is statistically significant.
Con’t...
• How to interpret correlation value
• If correlation is < 0.3: Weak correlation
• If correlation is between 0.3 and 0.7: Moderate correlation
• If correlation is > 0.7: Strong correlation
• Regression Analysis: The process of estimating or predicting the
average value of one Dependent variable (assumed to be stochastic) on the
basis of the other independent variables (assumed to be non-stochastic).
• Regression is an equation that allows us to express the relationship
between two or more variables algebraically.
• In regression the independent variables help us to predict the values of the
dependent variable.
• Correlation shows magnitude and direction of relationship only and
prediction of one variable based on the other is not possible.
• The simple linear regression analysis helps you to find out the
relationship between two variables, its strength and direction.
The relationship is explained by Pearson correlation coefficient (r)
The strength is explained by coefficient of determination (R2) and
Coefficient of determination R2 = (r2)
Finally the direction of relationship is interpreted by “b” coefficient.
2.2 Simple Linear Regression Model
• Simple Linear regression model (Two-variable model) is the
single most useful tool in econometrician’s kit.
• The model has one input and one output variable only.
• It is the most elementary type of regression model that can
be expressed by the following equation:
Yi =a + bXi + ui
where
i =used to index the observation of sample data (i= 1,2,3,...n)
Yi =dependent variable
Xi = Explanatory (independent) variable
a = constant (intercept) value and
b = slope of relationship
i =disturbance term or error term
With simple regression analysis, we can predict the future value
based on historical data.
Dependent Variable Y; Explanatory Variable Xs
1. Y = Son’s Height; X = Father’s Height
2. Y = Height of boys; X = Age of boys
3. Y = Personal Consumption Expenditure
X = Personal Disposable Income
4. Y = Demand; X = Price
5. Y = Rate of Change of Wages
X = Unemployment Rate
6. Y = Money/Income; X = Inflation Rate
7. Y = % Change in Demand; X = % Change in the
advertising budget
8. Y = Crop yield; Xs = temperature, rainfall, sunshine,
fertilizer
6
Terminology and Notation
Dependent Variable Independent Variable(s)
Explained Variable Explanatory Variable(s)
Predictand Predictor(s)
Regressand Regressor(s)
Response Stimulus or control
variable(s)
Endogenous Exogenous(es)
7
Example of Simple linear regression model…
Given: Salary = a + b Edu + Ui
where
o Salary is measured in birr per year
oEdu is measured in years of schooling
Q1. What are the factors included in the error term(Ui)?
Answer: Work experience, Age, Gender , marital status
(married or single), race(white or non- white) etc.
Two-variable Regression Model…
o The observed value of Yi is the sum of two parts.
SRF:
• Estimation of by OLS involves
finding values for the estimates 𝛼 and 𝛽 which
will minimize the sum of square of the squared
residuals ( ).
Conti……
Estimation: Deriving the OLS estimates…
• where = =
= - =3(3100)-(140)(60) =0.64
2
n -( 3(7000)-(140)2
0= Y- 1X =20-0.64(140/3) =20-29.87=-9.87
Therefore, the fitted regression model is given by: Yi = -9.87+0.64 Xi
b) Interpretation,
The value of the intercept term, 0=-9.87 thousands, means, when the HHS
disposable income is zero, the HHS consumption expenditure is ten
thousand birr through dissaving.
The value of the slope coefficient (=0.64) means, when the HHS disposable
income increases by 1 birr, their consumption increases by 0.64 cent.
c) At x= 45 thousand birr, Yi = -9.87 +0.64 Xi =-9.87+ (0.64)(45)=18.93
thousand birr= 18,930 birr.
Example for Derived formula
Q2. Find the Regression equation for the data under example 3.1,
using the shortcut formula.
To solve this problem we proceed as follows.
Y X Yi=Y-y xi=X-x xiyi xi2
10 30 -10 -16.67 166.67 277.78
20 50 0 3.33 0.00 11.11
30 60 10 13.33 133.33 177.78
Sum 60 140 0 0 300 466.67
mean 20 46.67
= = 300 = 0.64
466.67
Sales volume 58 105 88 188 117 137 157 169 149 202
a) Develop the regression model and Compute the values of
parameters and and interprets the values of and .
b) If the level of advertisement cost be 27 thousand birr, what will
be the predicted sales volume?
c) Compute the Pearson correlation coefficient r and coefficient
of determination (R2) and interpret their results.
d) Use the deviation formula (Method 2) and calculate the value
of error term in the model.
2. Suppose you examined the effect of Education on salary and
formulated the econometric model as follows.
Salary=20+2.1edu+e
24
What kinds of variables do “e” in the above model represents?
Individual Assignment(10%)
1. Suppose a manager has been spending money year after year on
advertisement to promote the sales of his firm’s product. The annual sales figures
are in thousands of birr and ad-expenditure is in millions of birr, as presented
below.
Ad-Exp : 5 8 10 12 10 15 18 20 21 25
Sales : 45 50 55 58 58 72 70 85 72 85
Required: By OLS method,
a) Develop the simple Regression model and interpret the value of a and b
b) If managers decided to spend 28 million birr in year 2000 ,then predict the
approximate sales volume in this year
c) Compute the Pearson correlation coefficient r and coefficient of determination
(R2) and interpret their results.
d) Use the deviation formula (Method 2) and calculate the value of error term in
the model.
2. Suppose you examined the effect of Education on salary and formulated the
econometric model as follows. Salary=20+2.1edu+e
25
What kinds of variables do “e” in the above model represents?
2.6 Assumptions of the LCRM
o The classical linear regression model(CLRM) consists
of a set of assumptions (commonly known as the
Gauss-Markov Assumptions) that describes:
o Forms of the model and relationships among its parts
and appropriate estimation and inference procedures.
A1:The regression model is linear in parameter.
i.e It may or may not be linear in Y and X.
Ex: Which of the ff model satisfy the assumption?
a) Yi = α+ βXi + Ui
b) LnYi = α+ β lnXi + Ui
c) Yi = α+ β Xi2+ Ui
d) Yi = α+ β2Xi+ Ui
A2: The mean value of the error term Ui is zero.
E(Ui)=0
Assumptions ….
A3: The variance of the error term Ui is constant called
Homoscedasticity. i.e Var(Ui) = 2
A violation of this assumption widely known as
Heteroskedasticity (non-constant variance) leads to a very
high standard errors and inconsistent sample estimates,
which may lead to a wider confidence interval.
A4:There is no correlation between two error terms.
i.e Cov(Ui,Uj)=0 for ij.
Of course, if i=j, there is autocorrelation problem and
occurs where successive disturbance terms are associated
with each other.
This perhaps leads to high mean error and hypothesis-testing
problem, as well as, F-value could be meaningless.
Assumptions ….
A5: There are no perfect or exact relationship among
X- variables. i.e no multicolinearity problem.
However, low correlations do not lead to inconsistence of
parameter estimates.
A violation of independence assumption indicates that there is
multi-collinearity problem among the explanatory variables,
which leads to a very high value of coefficient of determination
and inconsistent parameter estimates.
A6: The error term Ui follows normal distribution.
i.e Ui (0, 2)
A violation of this assumption occurs when there are
outliers in data set, and leads to problems of wider
confidence intervals and wrong hypothesis testing.
Assumptions ….
A7: Non-Endogeniety: any of the independent variables
should not be correlated with any error term.
that is, 𝐶𝑜𝑣 𝑋𝑖 , 𝜇𝑖 = 0
A departure from this assumption known as
endogeniety problem, occurs where irrelevant
variables or lagged dependent variable (s) are
introduced as independent variable(s) in a model. This
leads to high standard error and inefficient parameter
estimates
Properties of OLS estimators
If assumptions1-6 holds true, and determined by OLS
are BLUE.
What do BLUE stands for?
B= Estimators are best
L= Estimators are linear
U= Estimators are unbiased
E= Estimators are efficient
An estimator is called BLUE if:
A. Linear: Estimators are a linear function of the dependent
variable Y.
B. Unbiased: on average ,the estimators approach the true
population parameters.
C. Best: OLS estimators have minimum variance under the
class of linear and unbiased estimators.
D. Efficient: An unbiased estimator with the least variance is
known as an efficient estimator.
2.7 Model Validity Test
• How do you test whether the fit (or estimates) is good?
• The adequacy or validity of a regression model has been
checked using:
1) Coefficient of determination (R2) as Goodness of fit
2) ANOVA-Test (or F-statistic test ) as over all significance
3) T-statistic test as individual coeff. significance test
F-statistic is an overall test of the explanatory (or independent)
variables, while t-statistic is a test of significance for each (or
individual) explanatory variable, including the slope coefficient
and the constant term in a model.
a) R2-Tests of the „Goodness of fit‟
This method determines whether a regression model is valid or
adequately fit the data under investigation.
Now the total variation in Y, called TSS=Total Sum of Squares is
the decomposition of: RSS= Regression Sum of Squares and ESS=
Error Sum of Squares
• Mathematically, it is formulated as :
2.
where xi = Xi − X and yi = Yi − Y)
3. R2=(r)2 Where r is the correlation coeffecient
• The coefficient of determination ranges between 0 and 1
inclusive, while correlation coefficient ranges between -1
and 1 inclusive.
–When R2=1→the model fits perfectly. i.e ESS=0
–When R2=0→the model explains nothing.ie RSS=0
The domain of R2
• The largest value that R2 can assume is 1
(in which case all observations fall on the regression line and
this can not happen in the Empirical work)
R2 Closer to one indicates that the model is strong.
A low value of R2 (R2 Closer to zero) indicate that:
X is a poor explanatory variable in the sense that variation in
X leaves Y unaffected, or
While X is a relevant variable, its influence on Y is weak as
compared to some other variables that are omitted from the
regression equation, or
the regression equation is mis-specified (for example, an
exponential relationship might be more appropriate.
Interpretation of R2
R2 measures the percentage of total variation of the
dependent variable that can be explained by the
changes in the explanatory variable(s) included in the
model.
What do we mean when R2 =0.9?
About 90% of the variation in the dependent variable Y ,is
explained by the regression line but 10% of the variation in y is
due to other factors included in the error term.
Notice:
The proportion of total variation in the dependent
variable (Y) that is explained by X or by the regression
line is equal to: R2 x100%.
The proportion of total variation in the dependent
variable (Y) that is not explained by X or due to
factors other than X is equal to: (1– R2) x 100%.
B) The Analysis of Variance (ANOVA)-F test
• A small value of R2 casts doubt about the usefulness of the
regression equation. We do not, however, pass final judgment on
the equation until it has been subjected to an objective statistical
test.
• Such a test is accomplished by means of analysis of variance
(ANOVA) which enables us to test the significance of R2 (i.e., the
adequacy of the linear regression model).
To test for the significance of R2, we compare the variance ratio
with the critical value from the F distribution with k-1 and df =n-2
in the numerator and denominator, respectively, for a given
significance level α.
Decision: If the calculated variance ratio exceeds the tabulated
value, that is, if Fcal > Fα (k-1,n-2), we conclude that R2 is
significant (or that the linear regression model is adequate).
The F-test is designed to test the significance of all variables in a
regression model. In the two-variable model, however, it is used to
test the explanatory power of a single variable (X), and at the same
time, is equivalent to the test of significance of R2.
(ANOVA)-Table
• The ANOVA table for simple linear regression is given below
Analysis of Variance (ANOVA)-F test….
• F-value - measures the variance ratio of two independent
mean square errors estimates namely Regression Xi and
Residual i
• The ratio of these two values is known as F-value, which
assesses the significance of the overall effects of the
variables involved in the regression model; thereby testing
whether the model adequately represents or explains the
data.
where n = no of observation
k=no parameters estimated
Self test Exercise
Consider the following data on the percentage rate of change in
electricity consumption (millions KWH) (Y) and the rate of
change in the price of electricity (Birr/KWH) (X) for the years
1979 – 1994.
Summarized data:
• n = 16, x= 1.281, y= 23.427, xi2 = 92.201
• yi2= 13228.70, ∑xy = –779.235
Where xi = Xi − X and yi = Yi − Y)
Required
a. Estimation of regression coefficients βˆand the intercept αˆ
b. Test of model adequacy using R2. Is the regression model
adequate and useful for prediction ? Why? Justify your answer.
c. Test the overall significance of the estimated regression line using
F-test. What can you conclude?
Solution
a) Estimation of regression coefficients
•The slope β
ˆ and the intercept αˆ are computed as:
•βˆ = ∑xi yi=−779.235 =– 8.451
∑xi2 92.201
αˆ =Y − βˆX
•
TSS=yi2 = 13228.7
RSS= βˆ2xi2= (-8.451)2(92.201) =6585.679
•
1
SE(𝛽 )= 2.55* (3919654)−(9163)2
=0.0079
Solution …
c) 𝑦 = -59.12 + 0.35X
(Se) (0.35) (0.0079)
t ? ?
Required: Calculate t-value for 𝛼 and 𝛽 from the above model.
Solution:
We can calculate the t-value for 𝛼 and 𝛽 as follows.
• tcal for 𝛼 = 𝛼 /se(𝛼)=-59.12/.35=-1.69
𝛽
• tcal for 𝛽 = 𝑆𝑒(𝛽)
= =0.35/0.0079=44.3
The Model summary report:
c) 𝑦 = -59.12 + 0.35X
SE (0.35) (0.0079)
t (-1.69) (44.3)
Case 3: An individual Significance t-test Approach
The interest of an econometrician is not only in obtaining the
standard estimator 𝛼 and 𝛽 but also using it to make inferences
about the true parameter and .
• For this purpose, the error term is assumed to be normally
distributed and hence, the estimators are linear functions of the
error term, and can also be normally distributed.
The theory of estimation consists of two parts: Point estimation
and Interval estimation.
However, instead of relying on the point estimate alone, we may
construct an interval around the point estimator say, 95 percent
probability of including the true parameter value called interval
estimation. i.e. lies in 𝜷1 ± tα/2 se (𝜷1)
NB: In statistics the reliability of point estimator is measured by its
standard error (SE).
Example 1
Suppose that we have the following regression results from
a sample size n=20 of consumption function.