Level 2 r12 Multiple Regression

CFA® Level II – Quantitative Methods
Multiple Regression and Issues in Regression Analysis
www.irfanullah.co
Graphs, charts, tables, examples, and figures are copyright 2012, CFA Institute. Reproduced
and republished with permission from CFA Institute. All rights reserved.
www.irfanullah.co 1
Contents and Introduction
1. Introduction
2. Multiple Linear Regression
3. Using Dummy Variables in Regressions
4. Violations of Regression Assumptions
5. Model Specifications and Errors in Specification
6. Models with Qualitative Dependent Variables
www.irfanullah.co 2
2. Multiple Linear Regression
Multiple linear regression allows us to determine the effect of more than one
independent variable on a particular dependent variable
Y = b0+ b1X1i + b2X2i + εi
www.irfanullah.co 3
Example 1: Explaining the Bid—Ask Spread
A log-log regression model may be appropriate when one believes that

proportional changes in the dependent variable bear a constant
relationship to proportional changes in the independent variable(s).
www.irfanullah.co 4
Example 1: Evaluating the Regression Output
Read Examples 2 and 3

www.irfanullah.co 5
2.1 Assumptions of the Multiple Linear Regression
Model
In order to make a valid inference from a multiple linear regression model, we need to make the
following six assumptions:
1. The relationship between the dependent variable, Y and the independent variables X1 , X2, … , Xk is
linear
2. The independent variables (X1 , X2, …… , Xk ) are not random. Also no exact linear relation exists
between two or more of the independent variables
3. The expected value of error term, conditioned on the independent variables, is E(ε| X1 , X2, …… , Xk)
=0
4. The variance of the error term is the same for all observations: E(εi2 ) = σ²ε .
5. The error term is uncorrelated across observations: E(εi εj ) = 0 , j ≠ i.
6. The error term is normally distributed
www.irfanullah.co 6
2.2 Predicting the Dependent Variable in a Multiple
Regression Model
To predict the value of a dependent variable using a multiple linear regression model,
we follow these three steps:
Read Example 4
www.irfanullah.co 7
2.3 Testing whether All Population Regression
Coefficients Equal Zero
To test the null hypothesis that all of the slope coefficients in the multiple regression model are jointly
equal to 0 (H0 : b1 = b2 = … = bk = 0) against the alternative hypothesis that at least one slope coefficient
is not equal to 0 we must use an F-test.
www.irfanullah.co 8
2.4 Adjusted R2
R2 =
With one independent variation, coefficient of

determination measures the goodness of fit of an
estimated regression to the data.
As we add independent variables R2 increases

even if the amount they explain is not statistically
significant.
In multiple linear regression R2 is less useful.
www.irfanullah.co 9
Interpreting R2
• Three independent variables together explain 85% of the variation in Y
Adjusted R2 is also 85%
• Create new model where you add 10 more X variables

R2 = 87
Adjusted R2 = 83
• Is the new model better?
www.irfanullah.co 10
3. Using Dummy Variables In Regressions
Value = 1 if condition is true; value = 0 otherwise

If n states of the world then use n - 1 dummy variables
Are stock returns in Jan different from other months?
Example 5: Month-of-the-Year Effects on Small-Stock Returns
F-test
H0: all the slope coefficients (dummy variables)

are 0
Can we reject the null at a 5% significance?
The p-value of 0.1213 shown for the F-test in

Table 4 means that the smallest level of
significance at which we can reject the null
hypothesis is roughly 0.12, or 12 percent.
F-Table
df1 = 11
www.irfanullah.co
df2 = 276 13
Summary: F-stat, R-squared and Adjusted R-squared
• F stat = Mean regression sum of squares / Mean sum of squared errors

• F stat is used to test the null hypothesis that all slope coefficients are jointly equal
to 0
• R squared = explained variation/total variation

• When you have multiple independent variables this is also called the multiple R-
squared, multiple coefficient of determination, coefficient of determination
• Adjusted R squared
4. Violations of Regression Assumptions
1. Heteroskedasticity • For each violation we should

understand:
2. Serial Correlation What is it
Impact on statistical inference
How to detect
3. Multicollinearity How to correct
4. Summarizing the Issues
4.1 Heteroskedasticity
• Error term variance differs across observations
Unconditional not a problem
Conditional problem
• Consequences of heteroskedasticity
F-test for the overall significance of the regression is unreliable
Coefficient estimates fine but standard error understated
What is the impact on the t-stat?
• Testing for heteroskedasticity

Breuch-Pagan test. H0 : No heteroskedasticity
• Correcting for heteroskedasticity

Robust standard errors: corrects the standard errors of the linear regression model’s estimated
coefficients to account for conditional heteroskedasticity
Generalized least squares: modifies the original equation in an attempt to eliminate heteroskedasticity
www.irfanullah.co
Example 7 and Example 8 16
4.2 Serial Correlation
• Serial correlation (autocorrelation): errors correlated across observations
Assumption: independent variable is not a lagged value of the dependent variable
• The consequences of serial correlation

Coefficient estimates fine but standard error understated
t-stat and F-stat too high incorrectly reject the null hypothesis (Type I error)
• Testing for serial correlation

Durbin-Watson
• Correcting for serial correlation

Hansen method: adjust the coefficient standard errors for the linear regression parameter
estimates to account for the serial correlation (recommended)
Modify the regression equation itself to eliminate the serial correlation
4.3 Multicollinearity
• Multicollinearity: two or more independent variables (or combinations of
independent variables) are highly (but not perfectly) correlated with each
other
• Consequences of multicollinearity
Inflates SE’s t-stats of coefficients artificially small
• Detecting multicollinearity
A matter of degree rather than absence or presence
Symptom: high R2, significant F-stat, inflated standard error, low t-stat for coefficients
• Correction:
Omit one or more of the “X” variables
www.irfanullah.co
Example 9 18
4.4 Summarizing the Issues
Problem Effect Solution
Heteroskedasticity F-test is unreliable Robust standard errors
Standard error for coefficients will Generalized least squares

be underestimated
t-stat will be inflated
Serial correlation t-stat and F-stat too high Hansen Method
Modify the regression equation
Multicollinearity Inflated SE’s t-stats of Omit one or more of the “X” variables
coefficients artificially small
5. Model Specification and Errors in
Specification
1. Principles of Model Specification
2. Misspecified Functional Form
3. Time-Series Misspecification (Independent Variables Correlated

with Errors)
4. Other Types of Time-Series Misspecification
5.1 Principles of Model Specification
• The model should be grounded in cogent economic reasoning
• The functional form chosen for the variables in the regression should be
appropriate given the nature of the variables
• The model should be parsimonious
• The model should be examined for violations of regression assumptions
• The model should be tested on out of sample data
5.2 Misspecified Functional Form
Whenever we estimate a regression, we must assume that the regression has the correct
functional form. This assumption can fail in several ways:
1. One or more important variables could be omitted from regression

Example 10. Omitted variable bias. Regression coefficients inconsistent.
2. One or more of the regression variables may need to be transformed (for example, by
taking the natural logarithm of the variable) before estimating the regression
Example 11. Nonlinearity and the bid-ask spread.
Example 12
3. The regression model pools data from different samples that should not be pooled (shown
graphically on the next slide)
5.3 Time Series Misspecification
Regression assumption 3: error term has an expected value of 0. When working with time
series data, this assumption is frequently violated which causes the estimated regression
coefficients will be biased and inconsistent. (this is different from t-stats being biased and
inconsistent)
Three common problems that create this type of time-series misspecification are:
1. including lagged dependent variables as independent variables in regressions with serially

correlated errors
2. including a function of a dependent variable as an independent variable, sometimes as a

result of the incorrect dating of variables
3. independent variables that are measured with error (next slide)
Example 14: The Fisher Effect with Measurement Error
5.4 Other Types of Time-Series Misspecification
The most frequent source of misspecification in linear regressions that use time series from two or
more different variables is nonstationarity.
Nonstationarity: mean and variance are not constant through time.
Situations where we need to use stationarity tests before we use regression statistical inference
• Relations among time series with trends (for example, the relation between consumption and GDP)
• Relations among time series that may be random walks (time series for which the best predictor of
next period’s value is this period’s value). Exchange rates are often random walks.
6. Models with Qualitative Dependent
Variables
Qualitative dependent variables are dummy variables used as dependent variables instead of as
independent variables. For example, to predict whether or not a company will go bankrupt, we need
to use a qualitative dependent variable (bankrupt or not) as the dependent variable and use data on
the company’s financial performance (e.g., return on equity, debt-to-equity ratio, or debt rating) as
independent variables.
Linear regression is not appropriate in these situations. We should use probit, logit, or discriminant
analysis. Probit and logit models estimate the probability of a discrete outcome given the values of the
independent variables used to explain that outcome. The probit model, which is based on the normal
distribution, estimates the probability that Y = 1 (a condition is fulfilled) given the value of the
independent variable X. The logit model is identical, except that it is based on the logistic distribution
rather than the normal distribution. Discriminant analysis yields a linear function which can then be
used to create an overall score. Based on the score, an observation can be classified into the bankrupt
or not bankrupt category.
Summary
• ANOVA
• Assumptions
• F-stat, R2 and adjusted R2
• Dummy variables
• Heteroskedasticity
• Serial correlation
• Multicollinearity
• Model misspecifications
• Qualitative dependent variables
Conclusion
• Read summary
• Review learning objectives
• Examples
• Practice problems
• Practice from other sources

Level 2 r12 Multiple Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Level 2 r12 Multiple Regression

Uploaded by

Copyright:

Available Formats

CFA® Level II – Quantitative Methods

Multiple Regression and Issues in Regression Analysis

2. Multiple Linear Regression

3. Using Dummy Variables in Regressions

4. Violations of Regression Assumptions

5. Model Specifications and Errors in Specification

6. Models with Qualitative Dependent Variables

Y = b0+ b1X1i + b2X2i + εi

A log-log regression model may be appropriate when one believes that

Read Examples 2 and 3

With one independent variation, coefficient of

As we add independent variables R2 increases

In multiple linear regression R2 is less useful.

• Create new model where you add 10 more X variables

• Is the new model better?

Value = 1 if condition is true; value = 0 otherwise

Are stock returns in Jan different from other months?

H0: all the slope coefficients (dummy variables)

Can we reject the null at a 5% significance?

The p-value of 0.1213 shown for the F-test in

• F stat = Mean regression sum of squares / Mean sum of squared errors

• R squared = explained variation/total variation

1. Heteroskedasticity • For each violation we should

4. Summarizing the Issues

• Testing for heteroskedasticity

• Correcting for heteroskedasticity

• The consequences of serial correlation

• Testing for serial correlation

• Correcting for serial correlation

Standard error for coefficients will Generalized least squares

t-stat will be inflated

Serial correlation t-stat and F-stat too high Hansen Method

Modify the regression equation

2. Misspecified Functional Form

3. Time-Series Misspecification (Independent Variables Correlated

4. Other Types of Time-Series Misspecification

appropriate given the nature of the variables

• The model should be parsimonious

• The model should be examined for violations of regression assumptions

• The model should be tested on out of sample data

1. One or more important variables could be omitted from regression

1. including lagged dependent variables as independent variables in regressions with serially

2. including a function of a dependent variable as an independent variable, sometimes as a

3. independent variables that are measured with error (next slide)

Nonstationarity: mean and variance are not constant through time.

• F-stat, R2 and adjusted R2

• Qualitative dependent variables

• Review learning objectives

• Practice from other sources

You might also like