You are on page 1of 61

6.

Regression analysis

Otavio R de Medeiros

Regression analysis
Introduction A regression is a model of the relationship between one variable on one side and one or more variables on the other side Regression analysis constructs and tests a mathematical model of the relationship between one dependent (endogenous) variable and one or more independent (exogenous) variables. The direction of causality between the variables is determined from a priori information (e.g. theory) and embodied in the model by way of hypothesis The regression analysis tests the statistical strength of the model as hypothesized E.g. the level of the FTSE 100 is linearly dependent on the SP 500. We can test this hypothesis using simple linear regression (Fig. 6.1) Regressions can be simple (2 variables) or multiple (> 2 variables) There are 3 types of regression:
Time series Cross section Panel data
Otavio R de Medeiros 2

Regression analysis Y = dependent variable Regression line: Y = E + FX outlier

0
Otavio R de Medeiros

X = independent variable
3

Regression analysis Simple linear regression:

Y !E FX e
where Y = dependent variable; X = independent variable; E = constant or intercept; F = slope or regression coefficient; e = random error or disturbance term The error term exists because there are other unobserved and unknown effects not included in the regression We cannot infer causality Regression analysis cannot prove a hypothesis, it can only support or not support the hypothesis formulated

Otavio R de Medeiros

Regression analysis Ordinary least-squares (OLS) regression To test the relationship between Y and X it is necessary to derive the values of E, F and e by using a method that is Best, Linear, Unbiased Estimator (BLUE)
Best: most efficient, i.e. smallest variance Linear Unbiased: E(E) = E; E(F) = F

Ordinary least squares (OLS) minimizes the sum of squares of e, i.e. minimize 7 e2 If the data complies with the assumptions which will be seen later, OLS gives the BLUE, i.e. it gives the straight line that best fits the data by calculating the line that minimizes the sum of squares of the errors between Y and Y (Fig. 6.2)
Otavio R de Medeiros 5

Regression analysis Statistical assumptions of OLS regression: 1. The mathematical formula of the relationship between the true dependent variable Y and the independent variable X is:

Y !E F
Estimated model:
Y !E  FX

e

2. The error term e is normally distributed with zero mean and constant variance W2, i.e. e ~ N(0,W2) 3. The successive error terms are independent of each other, i.e. cov eiej = 0 4. X is non-stochastic (or exogenous, please explain...)
Otavio R de Medeiros 6

Regression analysis Normality is also known as Gaussianity, i.e. having a Gaussian distribution. If e has constant variance W2, this is called homoscedasticity If the variance of e is not constant, this is called heteroscedasticity, i.e. the opposite of homoscedasticity If cov eiej = 0, the residuals e is called non-autocorrelated or non-serially correlated. This assumption means that the factors that caused one value of Y to show error does not automatically cause all the observations of Y to show error When the e values are independent the data are said to be nonautocorrelated
Otavio R de Medeiros 7

Regression analysis As Y is related to e in a linear form, Y itself is a random variable For any values of X, Y will be ~ N(Q, W2) and therefore the statistical distribution of Y can be fully described by its mean and variance The expected value (mean) of Y:

Yi ! E  F ! E  F X i  E (ei )
But since E(ei) = 0

 ei

E (Yi ) ! E (E  F X i  ei ) ! E (E )  E ( F X i )  E (ei ) !

E (Yi ) ! E  F X i
Otavio R de Medeiros 8

Regression analysis As the expected value of e = 0, the variance of Y, which is also the variance of e, is the mean value of e2, i.e. 7(ei 0)2/n = 7(ei 0)2/n = E(e2i) = W2 Thus Y ~ N(E + FX, W2) This can be seen in Fig. 6.3.

Otavio R de Medeiros

Regression analysis The variance of Y: Given the regression model

Yi ! E  F X i  ei
If we take variances on both sides, we get

Var (Yi ) ! Var (E  F X i  ei ) ! Var (E )  Var ( F X i )  Var ( ei ) ! (ei  e ) 2 (ei  0) 2 (ei ) 2 ! Var (ei ) ! ! ! ! E (ei2 ) ! W 2 n n n
Thus Yi ~ N(E + FXi, W2)
Otavio R de Medeiros 10

Regression analysis Fitting the regression line The values of E and F that minimize 7e2 are

cov F! var

[(  ! (

)(Y  Y )]  )2

E !Y F

The error terms (residuals) are

ei ! Yi  Y

Where Yi is the observed value of Y and Y is the estimated value of Y.

Otavio R de Medeiros

11

Regression analysis Demonstration:


SS ! (Y  Y ) 2 ! (Y  E  F X ) 2 x ( SS ) ! ( 2(Y  E  F X )) xE x ( SS ) ! ( 2 X (Y  E  F X )) xF

SS is minimized when the partial derivatives are set to zero, i.e. (2(Y  E  F X )) ! 0 This is achieved when

(2 X (Y  E  F X )) ! 0
Y ! nE  F Y !E  F
Otavio R de Medeiros
2

12

Regression analysis This is a simultaneous equation problem Multiply the 1st equation by 7X and the second by n

X Y ! nE X  F ( X )2 n XY ! nE X  nF X 2
Subtracting 1st equation from the 2nd gives
n Y  @

Y ! nF
F!
n

 F (

) 2 ! F (n

 ( ) 2 )

n Y 
2

 ( ) 2

Y
13

Otavio R de Medeiros

Regression analysis Since

X ! nX
F!

and

Y ! nY XY  nXY ! X  nX
2 2 XY X

n XY  nXnY n X 2  (nX )2

[( X  X )(Y  Y )] ! cov F! var (X  X )


2

Otavio R de Medeiros

14

Regression analysis Since

Y ! nE  F
1 1 Y !E  F n n

Dividing by n
@Y ! E  F

Solving for E:

E !Y F

Otavio R de Medeiros

15

Regression analysis Example: table 6.1 (page 189) and page 194
cov XY F! ! varX

[( X  X )(Y  Y )] ! 644387.5 ! 5.964 108046.7 (X  X )


2

E ! Y  F X ! 2530.74  5.964 v 391.4187 ! 196.3298

Y ! 196.3298  5.964 X

Otavio R de Medeiros

16

Regression analysis Interpretation of the regression equation: Y


Y ! 196.3298  5.964 X

Tan U = 5.96 (slope)

196.33 (intercept)
Otavio R de Medeiros

17

Regression analysis Significance tests of coefficients As shown in Fig. 6.3., calculating the regression coefficients gives single estimates of Y The estimated regression coefficients are also assumed to come from a normal distribution We need to know the statistical significance of these coefficients, by testing that the regression coefficients are significantly different from zero. The statistical significance of the coefficient is measured by the degree of dispersion around the estimated value As the errors or residuals are assumed to be ~ N (Q, W2), the standard deviation of the errors is used to measure that dispersion These standard deviations are called standard errors of the coefficients
Otavio R de Medeiros 18

Regression analysis Significance tests of coefficients We use the t-statistics to indicate the degree of significance of the coefficients To derive these measures we need to know:
The sampling distribution of those coefficients Estimates of their variances and their standard deviations

We can perform test of hypotheses concerning the coefficients or construct confidence intervals for them

Otavio R de Medeiros

19

Regression analysis Sampling distribution


The sampling distribution of E is

W 2 X 2 E ~ N E , n ( X  X ) 2
The sampling distribution of F is

W2 ~ N F, F ( X  X )2

Otavio R de Medeiros

20

Regression analysis The estimated variances and standard deviations

s !
2

ei2 n2

The standard errors


SE of E
SE (E ) !

(Y  Y ) X n2 n ( X  X )
2 i i 2 i

2 i

SE of F
SE ( F ) !

(Y  Y )
i i

n2 ( X i  X )2
21

Otavio R de Medeiros

Regression analysis For data with a normal distribution, the difference between a variable and its mean divided by the estimate of its standard deviation has a t-distribution The probability statements are:
E E P tn  2,c / 2 e e tn  2, c / 2 ! 1  c SE (E ) F F P tn  2,c / 2 e e tn  2, c / 2 ! 1  c SE ( F )

From these equations we derive the confidence intervals


P E  tn  2,c / 2 SE (E ) e E e E  t n 2, c / 2 SE (E ) ! 1  c P F  tn  2,c / 2 SE ( F ) e F e F  tn 2, c / 2 SE ( F ) ! 1  c

where c = probability of the variable being in the tail of the tdistribution


Otavio R de Medeiros 22

Regression analysis

Thus we have 1-c probability that the true value of the coefficients falls within the range specified. If that range includes zero, the coefficients are not statistically significantly different from zero.

Otavio R de Medeiros

23

Regression Analysis Estimating the Variance of the Disturbance Term

The variance of the random variable ut is given by Var(ut) = E[(ut)-E(ut)]2 which reduces to Var(ut) = E(ut2) We could estimate this using the average of ut2:

1 s ! ut2 T
2

Unfortunately this is not workable since ut is not observable. We can use the sample counterpart to ut, which is ut : 1 2 2

s !

u T

But this estimator is a biased estimator of W2.

Otavio R de Medeiros

24

Regression Analysis Estimating the Variance of the Disturbance Term

An unbiased estimator of W is given by s !

2 t

T 2

where

2 t

is the residual sum of squares and T is the sample size.

Otavio R de Medeiros

25

Regression Analysis Estimating the Variance of the Disturbance Term

Derivation of the OLS standard error estimator:


F ! ( X ' X )1 X ' y ! ( X ' X )1 X '( X F  u ) ! ( X ' X )1 X ' X F  ( X ' X )1 X ' u ! ! F  ( X ' X ) 1 X ' u Var ( F ) ! E[( F  F )( F  F ) '] ! E[( F  ( X ' X ) 1 X ' u  F )( F  ( X ' X ) 1 X ' u  F ) '] ! ! E[(( X ' X ) 1 X ' u )(( X ' X ) 1 X ' u ) '] ! E[( X ' X ) 1 X ' uu ' X ( X ' X )1 ] ! ! ( X ' X ) 1 X ' E (uu ') X ( X ' X )1 But E (uu ') ! s 2 I @Var ( F ) ! ( X ' X ) 1 X ' s 2 IX ( X ' X ) 1 @ Var ( F ) ! s 2 ( X ' X )1

Otavio R de Medeiros

26

Example

Example: The following model with k=3 is estimated over 15 observations: and the following data have been calculated from the original Xs.
30 2.0 35 10 . . .  ( X X ) 1 ! 35 10 65 ,( X y) ! 2.2 , u' u ! 10.96 . . . 0.6 10 65 4.3 . .

y ! F 1  F 2 x 2  F 3 x3  u

Calculate the coefficient estimates and their standard errors. To calculate the coefficients, just multiply the matrix by the vector to obtain X ' X 1 X ' y . To calculate the standard errors, we need an estimate of W2.
2 ! RSS ! 10.96 ! 0.91 s

Tk

15  3

Otavio R de Medeiros

27

The variance-covariance matrix of F is given by


183 320 0.91 . . s2 ( X ' X ) 1 ! 0.91( X ' X ) 1 ! 320 0.91 594 . . 0.91 594 393 . .

The variances are on the leading diagonal:

  Var ( F1 ) ! 183 . SE ( F1 ) ! 1.35   Var ( F2 ) ! 0.91 SE ( F2 ) ! 0.96   Var ( F ) ! 3.93 SE ( F ) ! 1.98


3 3

We write:

y ! 1 .10  4 .40 x 2 t  19 .88 x3t

1.35 0.96 1.98


Otavio R de Medeiros 28

A Special Type of Hypothesis Test: The t-ratio

Recall that the formula for a test of significance approach to hypothesis testing using a t-test was
F  F i* test statistic ! i  SE F i

If the test is H0 : Fi = 0 H1 : F i { 0 i.e. a test that the population coefficient is zero against a two-sided alternative, this is known as a t-ratio test: F i Since F i* = 0, test stat ! 
SE ( F i )

The ratio of the coefficient to its SE is known as the t-ratio or t-statistic. Otavio R de Medeiros 29

The t-ratio: An Example

In the last example above: Coefficient SE t-ratio

1.10 1.35 0.81 = = =

-4.40 0.96 -4.63 12 d.f. 2.179 3.055

19.88 1.98 10.04

Compare this with a tcrit with 15-3 (2% in each tail for a 5% test)

5% 1%

Do we reject H0: H 0: H 0:

F1 = 0? F2 = 0? F3 = 0?

(No) (Yes) (Yes)


30

Otavio R de Medeiros

Regression analysis Hypothesis testing:


The regression equation is frequently created to test a hypothesis. This is achieved by setting up the null hypothesis H0 that the coefficients are not statistically significantly different from zero, e.g.

H0 :E ! 0 H1 : E { 0 H0 : F ! 0 H1 : F { 0
To test these hypotheses, we need to calculate the t-statistics for the coefficients:

E SE (E ) F tF ! SE ( F ) tE !

Otavio R de Medeiros

31

Regression analysis It is usual to test for statistical significance at the 95% or 99% level of confidence. That means that there is 95% or 99% probability that the values of E and F are not due to chance. The probability distribution of the t-statistics is a t-distribution with n-2 degrees of freedom. Degrees of freedom refers to the number of pairs of data points used in the regression. The regression coefficients are significant if the t statistic is greater than the critical value given in the t distribution tables From the book example (page 198), F= 5.964 and SE(F) = 0.3476, hence t = 5.964/0.3476 = 17.1577 The test statistic for E is t = 196.3298/136.991 = 1.4332 95% confidence intervals:
for E :196.3298  2 v136.991 e E e 196.3298  2 v136.991 p 77. 65 e E e 470.31 for F : 5.964  2 v 0.3476 e F e Otavio R devMedeiros p 5.27 e F e 6.66 5.964  2 0.3476
32

Regression analysis A one-tailed test or a two-tailed test? We have to decide whether the significance test will be a 1tailed or a 2-tailed test This decision is made before the regression results are known The choice is determined by the theory underlying the model between X and Y which the regression is testing E.g.: if a theory says that the slope of the relationship between X and Y should be greater than one, our test should be
H0: F = 1 H1: F > 1

If we reject H0, it means that the model is according to theory

Otavio R de Medeiros

33

Regression analysis Goodness of fit: the coefficient of determination R2

Y
Y
Y

X
Otavio R de Medeiros 34

Regression analysis Goodness of fit: the coefficient of determination R2 . . . . . .. . . . . . . . .. .

The total sum of squares (SST) is the sum of the squared differences between Y and Y , i.e. SST ! (Yi  Y ) 2 The sum of squares due to the regression is the sum of the squared differences between Y and Y SSR ! (Yi  Y )2 The sum of squares due to the error is the sum of the squared differences between Y and Yi SSE ! (Y  Y ) 2

Otavio R de Medeiros

35

Regression analysis SST=SSR+SSE The ratio between SSR and SST gives the proportion of the variation in Y explained by the variation in X and is referred to as R2 = coefficient of determination or goodness of fit

SSR 2 R ! ! SST

(Yi  Y ) 2 (Yi  Y ) 2

Otavio R de Medeiros

36

Regression analysis
SSR 2 R ! ! SST @ R2 (Yi  Y ) 2
2 i

(Y  Y ) SSE (Y  Y ) ! 1 ! 1 SST (Y  Y )
i i i

but SSR ! SST  SSE


2 2

! 1

(Yi  Y ) 2

ei2

! 1

e 'e ( Y - Y v i) '( Y - Y v i )

where i = column vector of 1s.

If the regression is so good that all the points lie exactly on the regression line, then Yi ! Yi In this case we would have R2 = 1, i.e. a perfect regression If the regression is very bad, the regression line will be the mean, i.e. Yi ! Y @ (Yi  Yi ) 2 ! (Yi  Y ) 2 Hence, R2 will have a value ranging from 0 to 1, i.e. 0 < R2 < 1
Otavio R de Medeiros 37

Regression analysis If R2 is multiplied by 100 and expressed as a percentage, it represents the proportion of the variation in Y that is explained by the variation in X. R2 is a random variable and it has a F distribution 2 The test statistic is
Fk 1,n 2 ! (k  1) 1  R2 n2

The test has 1 degree of freedom in the numerator and n-2 DFs in the denominator

Otavio R de Medeiros

38

Regression analysis Example FTSE x S&P500: R2 = 0.8548; n = 52; DF = 1;50


H0: R2 = 0 H1: R2 > 0

The test statistic is:

F1,50

R2 ! 1  R2 n2

0.8548 ! ! 294.4 1  0.8548 50

From the F table, we see that the 5% critical value for v1 = 1 and v2 = 50 is 4.03 As the value of the test statistic (294.4) is greater than 4.03, we reject the null that R2 = 0

Otavio R de Medeiros

39

Regression analysis Using regression for prediction The prediction interval: The results of applying the OLS model can be used for prediction E.g. suppose that we wish to predict the level of the FTSE 100 if the SP 500 rose to 550. The predicted value would be:

Y ! 196.3298  5.963972 v 550 } 3477


We want to know the degree of confidence to place on that estimated value. For this purpose we calculate the standard error of the estimate and then the prediction interval

Otavio R de Medeiros

40

Regression analysis Using regression for prediction (SKIP) The standard error of the estimate = standard error of the regression is

s !
2

ei2 n2

Y  Y !
i i

n2

The prediction interval would be calculated as

1 ( X *  X )2 Y s t99 v s 1   n ( X i  X )2
where 99 indicates the level of confidence and X* is the value used in the prediction, i.e. 550
Otavio R de Medeiros 41

Regression analysis The standard error of the regression is s = 114.27. The prediction interval is

X *  X )2 s t v s 1 1  ( Y 99 ! 2 n (Xi  X ) (550  391.42) 2 ! 3476 s 2.5 v114.27 1  0.0192  ! 108046.7 ! 3476 s 319.65
Thus we can consider with 99% confidence that if the S&P 500 rises to 550, the FTSE 100 will rise to 3476 +/ 320, i.e. between 3156 and 3796
Otavio R de Medeiros 42

Regression analysis
Spurious regressions: economic and financial time series are usually nonstationary variables (they trend over time and have unit roots). Regressions with non-stationary variables are not valid (spurious). There are tests to check for unit roots, the most popular being the ADF (Augmented Dickey-Fuller) and the PP (Phillips-Perron) tests. Unit roots are eliminated by differencing the variables

Non-stationary variable

Stationary variable

Otavio R de Medeiros

43

Regression analysis

Multiple regression: a regression model incorporating several independent variables is known as a multiple regression, i.e. Y ! E  F1 X 1  F 2 X 2  ...  F n X n  e The true relationship is unknown and we have to estimate

Y ! E  F1 X 1  F 2 X 2  ...  F n X n
The Fs are the partial derivatives of Y w.r.t the Xs, i.e.
xY xY xY ; F2 ! ; Fn ! F1 ! xX 1 xX 2 xX n
Otavio R de Medeiros 44

Regression analysis Computer packages (Eviews, SPSS, RATS, etc) are used to solve multiple regressions. Example of results given by software (data in Appendix 6.2, n = 51):
Y ! 0.215 0.209 X 1  0.934 X 2  0.302 X 3
( 0.39) (1.02) (6.42) ( 2.54)

R 2 ! 0.52; R 2 ! 0.49; DW ! 2.3; F ! 26.0 t statistics in parenthesis

The assumptions for the multivariate OLS are the same as for the univariate model However, the multivariate model has the additional assumption that the independent variables are independent of each other, i.e. cov(xj,xk) = 0  j { k
Otavio R de Medeiros 45

Regression analysis
Interpretation of results:

Y ! 0.215 0.209 X 1  0.934 X 2  0.302 X 3


( 0.39) (1.02) (6.42) ( 2.54)

R 2 ! 0.52; R 2 ! 0.49; DW ! 2.3; F ! 26.0 t statistics in parenthesis


The t-statistics for each independent variable are interpreted in exactly the same way as earlier, but the t-distributions have n 1 k degrees of freedom, where k = number of independent variables If there are k independent variables there will be k+1 parameters including the constant, this DF = n (k + 1) = n k 1 assumptions for the multivariate OLS are the same as for the univariate model In the example, n = 51, k = 3, DF = 47 The 95% level of confidence (2-tailed) gives t-value > 2.02 and at 99% a tvalue > 2.7 Thus the constant and X1 would not be significant at 95% and X2 would be significant at 99%
Otavio R de Medeiros 46

Regression analysis Adjusted In multivariate regressions, adding additional explanatory variables will cause R2 to increase. Consequently, R2 must be adjusted to take this into account:
R 2 ! 1  (1  R 2 ) where: n = number of observations k = number of independent regressors n 1 nk

R2

R2 =

Example: 51  1 R ! 1  (1  0.52) ! 0.49223 47


2 Otavio R de Medeiros 47

Regression analysis Test statistic:


R2 n  k F! ~ Fk 1,n  k 2 (1  R ) k  1 0.52 51  3 F! ! 26 (1  0.52) 3  1

The 1% critical value of the F statistic for 2 DF in the numerator and 48 in the denominator = 5.08 As the decision rule for testing H0 that R2 = 0 is to reject H0 if F > critical value, we reject H0.

Otavio R de Medeiros

48

Regression analysis Chow test for equality of sub-period coefficients (skip)


1. 2. 3. Run the regression over the complete data series and find the SSE1 Run the regression over the separate periods and find SSE2 with n observations and SSE3 with m observations Calculate the Chow statistics
( SSE1  SSE2  SSE3 ) / k ~ Fk ,m  n  2 k ( SSE2  SSE3 ) /( n  m  2k )

Otavio R de Medeiros

49

Regression analysis Breakdown of the OLS assumptions


Heteroscedasticity Autocorrelation Multicollinearity

Otavio R de Medeiros

50

Regression analysis Heteroscedasticity


If the residuals have a constant variance they are homoscedastic, otherwise they are heteroscedastic The effects of heteroscedasticity are that the regression coefficients are no longer the best minimum variance estimators The consequence of heteroscedasticity on the prediction interval estimation and hypothesis testing is that although the coefficients are unbiased, the variances and standard errors of coefficients will be biased Thus, we may accept the null hypothesis when it should be rejected and vice-versa Test for heteroscedasticity: Goldfeld-Quant test

Otavio R de Medeiros

51

Regression analysis The Goldfeld-Quant test (skip)


1. 2. 3. Divide residuals in 2 groups of n observations, one group with small values and the other with large values The middle one-sixth of observations is removed after sorting in ascending order Compute the test statistic

SSEH GQ ! ~ Fn c ,2 k SSEL

Otavio R de Medeiros

52

Regression analysis Solution to heteroscedasticity:


Observe the relationship between the error terms and transform the regression model in a way that reflects that relationship This may be achieved by regressing the error terms on various functional forms of the variable that causes the heteroscedasticity, e.g.

ei ! E  F X iH
X = independent variable that is assumed to be the cause of heteroscedasticity; H = power of the relationship = 2, 1/n, ... The variance of the coefficients becomes E (W i2 ) ! W 2 X iH Thus if H = 1/2, we would transform the regression model to
Yi Xi
! E

Xi

F 

ei Xi Yi e E ! F  i Xi Xi Xi
53

If H = 2, the transformation would be


Otavio R de Medeiros

Regression analysis Autocorrelation


Occurs when the residuals are not independent of each other because current values of Y are influenced by past values A 1st order autocorrelation process AR(1) is

et ! V et 1  zt
Higher order processes: AR(2): et ! Vt 1et 1  Vt  2 et  2  zt AR(4):

et ! Vt 1et 1  Vt  2 et  2  Vt 3et 3  Vt  4 et  4  zt

Otavio R de Medeiros

54

Regression analysis Test for 1st order autocorrelation: the Durbin-Watson test
et2 To test for autocorrelation we test the following null hypothesis
H0: no autocorrelation if dU e d e 4-dU H1: positive autocorrelation d < dL negative autocorrelation d > 4-dL Inconclusive: dL < d < dU or 4-dU < d < 4-dL

DW !

(et  et 1 ) 2

dL

dU

2
Otavio R de Medeiros

4-dU

4-dL

4
55

Regression analysis Example (Brooks)


et2 To test for autocorrelation we test the following null hypothesis
H0: no autocorrelation if dU e d e 4-dU H1: positive autocorrelation d < dL negative autocorrelation d > dU Inconclusive: dL < d < dU or 4-dU < d < 4-dL

DW !

(et  et 1 ) 2

dL

dU

2
Otavio R de Medeiros

4-dL

4-dU

4
56

Regression analysis Autocorrelation may be caused by omitted variables or wrong functional form Can also be caused when lagged variables are introduced To solve the autocorrelation problem:
Consider the possibility of and correct for omitted variables or wrong functional form If this is unsuccessful, use the Orcutt-Cochrane procedure (skip):
Calculate the autocorrelation coefficient Change the equation to
V!

(e e e

t t 1 2 t

Yt  VYt 1 ! E  F ( X t  V X t 1 )  et
This will remove the 1st order autocorrelation from the data
Otavio R de Medeiros 57

Regression analysis Multicollinearity


When some or all independent variables in a multiple regression are highly correlated, the regression model has difficulty separating the effects of each X variable on Y. With multicollinearity the regression coefficients are unreliable. R2 is high but the SEs are also high, so that the coefficients are not significant Possible solutions
Add more data Drop some of the variables that are highly correlated Use factor analysis to transform the highly correlated variables in one single factor

Otavio R de Medeiros

58

Regression analysis Dummy variables


We use dummy variables or dummies when it is necessary to incorporate one or more qualitative variables into a regression Y=E+F1X+F2D+e Y=E+F1X+F2DX+e

Y=E+F1X+e

Y=E+F1X+e

Shift dummy

Slope dummy

Otavio R de Medeiros

59

Regression analysis Example: Henriksson & Merton (1981):


rp  rf ! E  F ( rm  rf )  c[ D(rm  rf )]  e

rp - rf = excess return to the portfolio rm rf = excess return to the market

Otavio R de Medeiros

60

Regression analysis Non-linear regression


It maybe that the relationship between Y and one or more of the X variables is non-linear Two ways of handling this problem is
Transform the data and apply linear regression Apply non-linear regression techniques

Data transformations
Y=EXF
F"

Y=EFX

Y=EXF
 F 

ln Y ! ln E  F ln X
Otavio R de Medeiros

Y ! E  F Z;

Z ! 1/ X
61