You are on page 1of 29

CFA® Level II – Quantitative Methods

Multiple Regression and Issues in Regression Analysis

www.irfanullah.co
Graphs, charts, tables, examples, and figures are copyright 2012, CFA Institute. Reproduced
and republished with permission from CFA Institute. All rights reserved.

www.irfanullah.co 1
Contents and Introduction
1. Introduction

2. Multiple Linear Regression

3. Using Dummy Variables in Regressions

4. Violations of Regression Assumptions

5. Model Specifications and Errors in Specification

6. Models with Qualitative Dependent Variables

www.irfanullah.co 2
2. Multiple Linear Regression

Multiple linear regression allows us to determine the effect of more than one
independent variable on a particular dependent variable

Y = b0+ b1X1i + b2X2i + εi

www.irfanullah.co 3
Example 1: Explaining the Bid—Ask Spread

A log-log regression model may be appropriate when one believes that


proportional changes in the dependent variable bear a constant
relationship to proportional changes in the independent variable(s).

www.irfanullah.co 4
Example 1: Evaluating the Regression Output

Read Examples 2 and 3


www.irfanullah.co 5
2.1 Assumptions of the Multiple Linear Regression
Model
In order to make a valid inference from a multiple linear regression model, we need to make the
following six assumptions:

1. The relationship between the dependent variable, Y and the independent variables X1 , X2, … , Xk is
linear
2. The independent variables (X1 , X2, …… , Xk ) are not random. Also no exact linear relation exists
between two or more of the independent variables
3. The expected value of error term, conditioned on the independent variables, is E(ε| X1 , X2, …… , Xk)
=0
4. The variance of the error term is the same for all observations: E(εi2 ) = σ²ε .
5. The error term is uncorrelated across observations: E(εi εj ) = 0 , j ≠ i.
6. The error term is normally distributed

www.irfanullah.co 6
2.2 Predicting the Dependent Variable in a Multiple
Regression Model
To predict the value of a dependent variable using a multiple linear regression model,
we follow these three steps:

Read Example 4
www.irfanullah.co 7
2.3 Testing whether All Population Regression
Coefficients Equal Zero
To test the null hypothesis that all of the slope coefficients in the multiple regression model are jointly
equal to 0 (H0 : b1 = b2 = … = bk = 0) against the alternative hypothesis that at least one slope coefficient
is not equal to 0 we must use an F-test.

www.irfanullah.co 8
2.4 Adjusted R2

R2 =

With one independent variation, coefficient of


determination measures the goodness of fit of an
estimated regression to the data.

As we add independent variables R2 increases


even if the amount they explain is not statistically
significant.

In multiple linear regression R2 is less useful.

www.irfanullah.co 9
Interpreting R2
• Three independent variables together explain 85% of the variation in Y
Adjusted R2 is also 85%

• Create new model where you add 10 more X variables


R2 = 87
Adjusted R2 = 83

• Is the new model better?

www.irfanullah.co 10
3. Using Dummy Variables In Regressions

Value = 1 if condition is true; value = 0 otherwise


If n states of the world then use n - 1 dummy variables

Are stock returns in Jan different from other months?

www.irfanullah.co 11
Example 5: Month-of-the-Year Effects on Small-Stock Returns

F-test

H0: all the slope coefficients (dummy variables)


are 0

Can we reject the null at a 5% significance?

The p-value of 0.1213 shown for the F-test in


Table 4 means that the smallest level of
significance at which we can reject the null
hypothesis is roughly 0.12, or 12 percent.

www.irfanullah.co 12
F-Table

df1 = 11
www.irfanullah.co
df2 = 276 13
Summary: F-stat, R-squared and Adjusted R-squared

• F stat = Mean regression sum of squares / Mean sum of squared errors


• F stat is used to test the null hypothesis that all slope coefficients are jointly equal
to 0

• R squared = explained variation/total variation


• When you have multiple independent variables this is also called the multiple R-
squared, multiple coefficient of determination, coefficient of determination

• Adjusted R squared

www.irfanullah.co 14
4. Violations of Regression Assumptions

1. Heteroskedasticity • For each violation we should


understand:
2. Serial Correlation What is it
Impact on statistical inference
How to detect
3. Multicollinearity How to correct

4. Summarizing the Issues

www.irfanullah.co 15
4.1 Heteroskedasticity
• Error term variance differs across observations
Unconditional not a problem
Conditional problem

• Consequences of heteroskedasticity
F-test for the overall significance of the regression is unreliable
Coefficient estimates fine but standard error understated
What is the impact on the t-stat?

• Testing for heteroskedasticity


Breuch-Pagan test. H0 : No heteroskedasticity

• Correcting for heteroskedasticity


Robust standard errors: corrects the standard errors of the linear regression model’s estimated
coefficients to account for conditional heteroskedasticity
Generalized least squares: modifies the original equation in an attempt to eliminate heteroskedasticity

www.irfanullah.co
Example 7 and Example 8 16
4.2 Serial Correlation
• Serial correlation (autocorrelation): errors correlated across observations
Assumption: independent variable is not a lagged value of the dependent variable

• The consequences of serial correlation


Coefficient estimates fine but standard error understated
t-stat and F-stat too high incorrectly reject the null hypothesis (Type I error)

• Testing for serial correlation


Durbin-Watson

• Correcting for serial correlation


Hansen method: adjust the coefficient standard errors for the linear regression parameter
estimates to account for the serial correlation (recommended)
Modify the regression equation itself to eliminate the serial correlation

www.irfanullah.co 17
4.3 Multicollinearity
• Multicollinearity: two or more independent variables (or combinations of
independent variables) are highly (but not perfectly) correlated with each
other

• Consequences of multicollinearity
Inflates SE’s t-stats of coefficients artificially small

• Detecting multicollinearity
A matter of degree rather than absence or presence
Symptom: high R2, significant F-stat, inflated standard error, low t-stat for coefficients

• Correction:
Omit one or more of the “X” variables

www.irfanullah.co
Example 9 18
4.4 Summarizing the Issues
Problem Effect Solution
Heteroskedasticity F-test is unreliable Robust standard errors

Standard error for coefficients will Generalized least squares


be underestimated

t-stat will be inflated

Serial correlation t-stat and F-stat too high Hansen Method

Modify the regression equation

Multicollinearity Inflated SE’s t-stats of Omit one or more of the “X” variables
coefficients artificially small

www.irfanullah.co 19
5. Model Specification and Errors in
Specification
1. Principles of Model Specification

2. Misspecified Functional Form

3. Time-Series Misspecification (Independent Variables Correlated


with Errors)

4. Other Types of Time-Series Misspecification

www.irfanullah.co 20
5.1 Principles of Model Specification
• The model should be grounded in cogent economic reasoning

• The functional form chosen for the variables in the regression should be

appropriate given the nature of the variables

• The model should be parsimonious

• The model should be examined for violations of regression assumptions

• The model should be tested on out of sample data

www.irfanullah.co 21
5.2 Misspecified Functional Form
Whenever we estimate a regression, we must assume that the regression has the correct
functional form. This assumption can fail in several ways:

1. One or more important variables could be omitted from regression


Example 10. Omitted variable bias. Regression coefficients inconsistent.

2. One or more of the regression variables may need to be transformed (for example, by
taking the natural logarithm of the variable) before estimating the regression
Example 11. Nonlinearity and the bid-ask spread.
Example 12

3. The regression model pools data from different samples that should not be pooled (shown
graphically on the next slide)

www.irfanullah.co 22
www.irfanullah.co 23
5.3 Time Series Misspecification
Regression assumption 3: error term has an expected value of 0. When working with time
series data, this assumption is frequently violated which causes the estimated regression
coefficients will be biased and inconsistent. (this is different from t-stats being biased and
inconsistent)

Three common problems that create this type of time-series misspecification are:

1. including lagged dependent variables as independent variables in regressions with serially


correlated errors

2. including a function of a dependent variable as an independent variable, sometimes as a


result of the incorrect dating of variables

3. independent variables that are measured with error (next slide)

www.irfanullah.co 24
Example 14: The Fisher Effect with Measurement Error

www.irfanullah.co 25
5.4 Other Types of Time-Series Misspecification
The most frequent source of misspecification in linear regressions that use time series from two or
more different variables is nonstationarity.

Nonstationarity: mean and variance are not constant through time.

Situations where we need to use stationarity tests before we use regression statistical inference

• Relations among time series with trends (for example, the relation between consumption and GDP)

• Relations among time series that may be random walks (time series for which the best predictor of
next period’s value is this period’s value). Exchange rates are often random walks.

www.irfanullah.co 26
6. Models with Qualitative Dependent
Variables
Qualitative dependent variables are dummy variables used as dependent variables instead of as
independent variables. For example, to predict whether or not a company will go bankrupt, we need
to use a qualitative dependent variable (bankrupt or not) as the dependent variable and use data on
the company’s financial performance (e.g., return on equity, debt-to-equity ratio, or debt rating) as
independent variables.

Linear regression is not appropriate in these situations. We should use probit, logit, or discriminant
analysis. Probit and logit models estimate the probability of a discrete outcome given the values of the
independent variables used to explain that outcome. The probit model, which is based on the normal
distribution, estimates the probability that Y = 1 (a condition is fulfilled) given the value of the
independent variable X. The logit model is identical, except that it is based on the logistic distribution
rather than the normal distribution. Discriminant analysis yields a linear function which can then be
used to create an overall score. Based on the score, an observation can be classified into the bankrupt
or not bankrupt category.

www.irfanullah.co 27
Summary
• ANOVA

• Assumptions

• F-stat, R2 and adjusted R2

• Dummy variables

• Heteroskedasticity

• Serial correlation

• Multicollinearity

• Model misspecifications

• Qualitative dependent variables

www.irfanullah.co 28
Conclusion
• Read summary

• Review learning objectives

• Examples

• Practice problems

• Practice from other sources

www.irfanullah.co 29

You might also like