You are on page 1of 10

CORRELATION

Relationship between two data series in a single number (linear relationship) Covariance: how a random variable varies with another random variable.

Properties of Covariance Cov(X ,Y) = Coy (Y, X) Cov(X,X) = Var(X)

Sample covariance

Sample correlation coefficient

Limitations: 1. A measure of linear association 2. Unreliable measure when there are outliers 3. Correlation does not imply causation 4. Correlations may be spurious: a. Reflects chance relationships b. The two variables are related by a third

Test of the hypothesis

LINEAR REGRESION
Predictions about a dependent variable, Y (explained variable, endogenous variable and predicted variable) Independent variable, X (explanatory variable, exogenous variable and predicting variable)

Regression model equation

Regression line equation

Line of best fit that minimizes the sum of the squared regression residuals

Sum of Squared Errors (SSE)

Assumptions 1. The Relationship between Y and X is linear in its parameters

2.

Xs are not random and no linear relationship exists between them

3.

Expected value of the error, conditioned on the independent variables is zero

4.

Variance of the error is constant

5.

Error terms are uncorrelated

6.

Error term is normally distributed

Standard Error of Estimate Used to measure how well a regression model captures the relationship between the two variables. It measures the standard deviation of the residual term in the regression.

The Coefficient of Determination and Adjusted Coefficient of Determination How well the independent variable explains the variation in the dependent variable. Its calculated in two ways: 1. Coefficient of Determination

2.

Adjusted Coefficient of Determination

Hypothesis Testing for Coefficients F- Test Statistic (All coefficients equal zero)

T-Test Statistic

Confidence interval

Important Points Increasing the significance level increases the probability of a Type I error Increasing the significance level decreases the probability of a Type II error. The p-value is the lowest level of significance at which the null hypothesis can be rejected. The smaller the standard error of an estimated parameter, the stronger the results of the regression and the narrower the resulting confidence intervals.

ANOVA TABLE Source Of Variation


Regression (explained) Error (unexplained) TOTAL

Degrees of Freedom

Sum of Squares

Mean Sum of Squares

Prediction Intervals Create a confidence interval for the dependent variable. Are two sources of uncertainty o The uncertainty inherent in the error term. o The uncertainty in the estimated parameters of the regression.

Limitations: 1. The relations can change over time (parameter instability) 2. Public knowledge of regression relationships may negate their usefulness 3. If the assumptions of regression analysis do not hold, the predictions based on the model will not be valid.

DUMMY VARIABLES Must be binary in nature For n categories use n-1 dummy variables The intercept term in the regression indicates the average value of the dependent variable for the omitted category. T-stats are used to test whether the value of the dependent variable in each category is different from the value of the dependent variable in the omitted. VIOLATIONS OF REGRESSION ANALYSIS HETEROSKEDASTICITY Definition: The variance of the error term in the regression is not constant Effects: o Does not affect the consistency of estimators of regression parameters. o Can lead to mistakes in inferences made from parameter estimates. o The F Test and T-Tests become unreliable Standard errors of regression coefficients are underestimated T-stats are inflated Types o o

Unconditional heteroskedasticity: variance in the error term is not related to the independent variables in the regression. Does not create problems Conditional heteroskedasticity: variance is correlated with the independent variables. Does create problems

Testing: The Breusch-Pagan (BP) o Regression of the squared residuals from the original estimated regression equation on the independent variables in the regression.

A one-tailed Chi-squared test because conditional heteroskedasticity is only I a problem if it is too large. Correcting Heteroskedasticity o Use robust standard errors (White-corrected standard errors) o Use generalized least squares with a modified regression

SERIAL CORRELATION Definition: Regression errors are correlated across observations. Effects: o Does not affect the consistency of the estimated regression coefficients unless one of the independent variables in the regression is a lagged value of the dependent variable. o When a lagged value of the dependent variable is not an independent variable in the regression, positive (negative) serial correlation: Does not affect the consistency of the estimated regression coefficients. F-stat to be inflated (deflated) because MSE will tend to underestimate the population error variance. Causes the standard errors for the regression coefficients to be underestimated (overestimated), which results in larger (smaller) t-values. Types o o

Positive serial correlation: a positive (negative) error for one observation increases the chances of a positive (negative) error for another. Negative serial correlation: a positive (negative) error for one observation I increases the chances of a negative (positive) error for another.

Testing: The Durban-Watson (DW) Test

R is the sample correlation between squared residuals from one period with all those from the previous period. First determine whether the regression residuals are positively or negatively serially correlated.

Correcting Serial Correlation o Hansens Method (also corrects for heteroskedasticity). The regression coefficient remains the same but the standard errors change. o Modify the regression equation to eliminate the serial correlation.

MULTICOLLINEARITY Definition: Two or more independent variables in a regression model are highly (but not perfectly) correlated with each other. Effects: o Does not affect the consistency of OLS estimates and regression coefficients, but makes them inaccurate and unreliable. o Difficult to isolate the impact of each independent variable on the dependent variable. o Standard errors for the regression coefficients are inflated, T-stats are too small and less powerful Testing: 2 o High R and significant F-Stat coupled with insignificant T-Stats Correcting for Multicollinearity o Stepwise Regression: Excluding one or more of the independent variables from the regression

MISSPECIFIED MODELS 1. Misspecified Functional Form a. One or more important variables may have been omitted from the regression b. A wrong form of the data may be used c. The model may pool data from different sources Time Series Misspecification a. Including lagged dependent variables as independent variables in regressions with serially correlated errors. b. Using the regression to forecast the dependent variable at time, t+ 1 based on independent variables that are a function of the dependent variable at time, t c. Independent variables are measured with error d. Nonstationarity variables: are not constant over time

2.

QUALITATIVE DEPENDENT VARIABLES A dummy variable used as a dependent variable instead of as an independent variable o Probit model: based on the normal distribution (maximum likelihood) o Logit model: based on the logistic distribution (maximum likelihood) o Discriminant analysis: linear function that is used to create an overall score on the basis of which an observation can be classified qualitatively. Must also watch out for heteroskedasticity, serial correlation and multicollinearity in the regression.

TIME SERIES
LINEAR TREND MODELS The dependent variable changes by a constant amount in each period (OLS) regression is used to estimate the regression coefficients Yt will grow by a constant amount b

LOG-LINEAR TREND MODELS Time series that exhibits exponential growth Yt will grow by a constant growth rate

AUTOREGRESSIVE (AR) MODEL Time series that is regressed on its own past values.

Covariance Stationary Series To conduct statistical inference we must assume that the time series is covariance stationary or weakly stationary. Requirements: 1. The expected value or mean of the time series must be constant and finite in all periods. 2. The variance of the time series must be constant and finite in all periods. 3. The covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all periods. If not covariance stationary: biased estimates and spurious results. Detecting Serially Correlated Errors in an AR Model The AR model can be estimated using ordinary least squares if 1. The time series is covariance stationary 2. The errors are uncorrelated. Test whether the autocorrelations of the error terms are significantly different from O. If significantly different from 0: the errors are serially correlated Not significantly different from 0, the errors are not serially correlated

Three steps: 1. Estimate a particular AR model. 2. 2. Compute the auto correlations of the residuals from the model. 3. 3. Detennine whether the residual auto correlations significantly differ from O. The Durbin-Watson: cannot be used to test for serial correlation in an AR model because the independent variables include past values of the dependent variable.

Mean Reversion If Y tends to fall when its current level is above the mean and tends to rise when its current level is below the mean

All covariance stationary time series have a finite mean-reverting level. Comparing Forecast Model Performance Comparing their standard errors In sample forecast errors the smaller standard error will be more accurate Out-of-sample forecasting evaluated on the basis of their root mean square error (RMSE). The lowest RMSE has the most predictive power. RANDOM WALKS

The error has a constant variance and is uncorrelated with its value in previous periods. Special case of the ARCH model where b0 = 0, and b1 = 1 Standard regression analysis cannot be applied o They do not have a finite mean-reverting level. o They do not have a finite variance. First differencing: convert a Random walk into a covariance stationary time series.

Modeling the change in the value rather than the value of the variable itself RANDOM WALK WITH A DRIFT Time series that increases or decreases by a constant amount in each period.

Has an undefined mean-reverting level (because b1 = 1) The Unit Root Test of Nonstationarity Ways to determine whether time series is covariance stationary: a. b. Examine the autocorrelations of the time series at various lags: time series autocorrelations at all lags do not significantly differ from to be a stationary time series Conduct the Dicky-Fuller test for unit root (preferred approach). A time series that has a unit root is a random walk i. H0 for the Dicky-Fuller test is that g1= 0 (non-stationary) ii. HA for the Dicky-Fuller test is that g < 0 (covariance stationary)

MOVING-AVERAGE TIME SERIES MODELS Smoothing Past Values with Moving Averages Weakness: It always lags large movements in the underlying data. Does not hold much predictive value Moving Average Time Series Models for Forecasting

Seasonality in Time Series Models Examine the autocorrelations of the residuals to determine whether the seasonal autocorrelation of the error term is significantly different from 0 To correct for seasonality, add a seasonal lag to the AR model

AUTOREGRESSIVE MOVING AVERAGE (ARMA) Combines autoregressive lags of the dependent variable and moving-average errors in order to provide better forecasts than simple AR models. ARMA (p,q)

Limitations: 1. The parameters of the model can be very unstable. 2. There are no set criteria for determining p and q 3. It may not do a good job of forecasting.

AUTOREGRESSIVE CONDITIONAL HETEROSKEDASTICITY MODEL (ARCH) To determine whether the variance of the error in one period depends on the variance of the error in previous periods ARCH(1) model: the squared residuals from a particular time series model (the model may be an AR, MA or ARMA model) are regressed on a constant and on one lag of the squared residuals. The regression equation takes the following:

If a1 = 0, then the variance is constant and the model can be used If a1 no equals 0. The error in period t+ 1 can then be predicted by:

Regressions with More than One Time Series Linear regression can be used to analyze the relationship between more than one time series/ Possible scenarios: 1. If neither of the time series has a unit root, linear regression can be used 2. 3. If either of the series has a unit root, linear regression cannot be used If both the series have unit roots, determine whether they are cointegrated: Cointegrated: if a long term economic relationship exists between the variables i. If not cointegrated linear regression cannot be used ii. If cointegrated linear regression can be used

Testing for Cointegration Following steps: 1. Estimate the regression 2. Test whether the error term has a unit root using the Dickey-Fuller test but with Engle-Granger critical values. 3. 3. H0: Error term has a unit root vs HA: Error term does not have a unit root 4. If we fail to reject the null hypothesis, the time series are not cointegrated, and the regression relation is spurious. 5. If we reject the null hypothesis, it is covariance stationary and the lime series are cointegrated