You are on page 1of 39

CORRELATION &

REGRESSION ANALYSIS
Measure of Association

 Measure of Association is a statistical technique used to measure the strength of a


relationship between two variables.

 Correlation analysis is most appropriate for interval or ratio variables. Regression can
accommodate either less-than interval independent variables, but the dependent variable
must be continuous.

 The chi-square (χ2) test provides information about whether two or more less-than
interval variables are interrelated.

 Correlation can be thought of as a standardized covariance.

 Covariance coefficients retain information about the absolute scale ranges and
so the strength of association for scales of different possible values cannot be
compared directly. This problem does not persist with correlation as it is
standardized (covariance) and hence independent of the units of measurement.
 The correlation coefficient, r, ranges from –1.0 to +1.0.

 If the value of r = +1.0, a perfect positive relationship exists. Perhaps the two
variables are one and the same!

 If the value of r = –1.0, a perfect negative relationship exists. The implication is


that one variable is a mirror image of the other. As one goes up, the other goes
down in proportion and vice versa.

 No correlation is indicated if r = 0.

 A correlation coefficient indicates both the magnitude of the linear relationship


and the direction of that relationship. For example, if we find that r= –0.92, we
know we have a very strong inverse relationship—that is, the greater the value
measured by variable X, the lower the value measured by variable Y.
Correlation Matrix
 A correlation matrix is the standard form for reporting observed correlations among
multiple variables. Each entry represents the bi-variate relationship between a pair of
variables.

The main diagonal consists of correlations of 1.00. Also REMEMBER that correlations
should always be considered with significance levels (p- values). If a correlation is not
significant it is of little use.
Coefficient of Determination (R2)

 The proportion of variance in Y that is explained by X (or vice versa), can be


calculated through the coefficient of determination (R2). The coefficient of
determination, R2, measures that part of the total variance of Y that is
accounted for by knowing the value of X.

 R2 = Explained variance / Total variance = SSR/ SST

 If the correlation between unemployment and hours worked is r = -0.635 the R2


= 0.403. About 40 percent of the variance in unemployment can be explained
by the variance in hours worked, and vice versa.

 Thus, R-squared really is just r squared!


Decomposition of the Total Variation

2 E x p la in e d v a r ia tio n
R =
T o ta l v a r ia tio n

S S x
=
S S y

= T o ta l v a r ia tio n - E r r o r v a r ia tio n
T o ta l v a r ia tio n

S S y - S S e rro r
=
S S y
Regression Analysis
 Regression analysis is a technique for measuring the linear
association between a dependent and an independent variable.

 Regression is a dependence technique whereas correlation is an


interdependence technique.

 With simple regression, a dependent (or criterion) variable, Y, is


linked to an independent (or predictor) variable, X. Regression
analysis attempts to predict the values of a continuous, interval-
scaled dependent variable from specific values of the
independent variable.
Simple Linear Regression
 Suppose there exists one independent variable (x) and an independent
variable (y) then the relationship between y and x can be denoted as
y = f(x).
 The above is a deterministic and a random relationship.

 If we now assume that f(x) is linear in x then


f(x) = α + βx

 Thus we can write y = α + βx + u where α + βx is the deterministic


component of y and u is the stochastic or random component.

 α and β are called regression coefficients that we ESTIMATE from the


data on y and x.
 If we have n observations on y and x then we can write the
simple linear regression as:-
y= α + βxi + ui where i = 1,2,3,……….,n.

 The objective is to obtain estimated values for unknown


parameters α and β given that there are n observations on y
and x.

 In the above simple linear regression equation α represents


the Y intercept (where the line crosses the Y-axis) and β is the
slope coefficient. The slope is the change in Y associated with
a change of one unit in X.
Estimate the Standardized Regression Coefficient and Test of
Significance

 Standardization is the process by which the raw data are transformed into new variables that
have a mean of 0 and a variance of 1.
 When the data are standardized, the intercept assumes a value of 0.

 There is a simple relationship between the standardized and non-standardized regression


coefficients:

Byx = byx (Sx /Sy)

 The statistical significance of the linear relationship between X and Y may be tested by examining
the hypotheses: H0: b1 = 0
H1: b1 ¹ 0

t= b
A t statistic with n - 2 degrees of freedom can be used, where
SEb
 SEb denotes the standard deviation of b and is called the standard error.
Assumptions of Regression

1) Zero Mean. E(ui ) = 0 for all observations or ‘i’.

2) Common variance. Var(ui ) = ϭ2 for all “i” .


This implies homoscedasticity.

3) Independence. ui and uj are independent for all i ≠ j.


This implies no auto or serial correlation. We should worry about the systematic effect
of X on Y and should not worry about the inter-correlation between the ui and uj.

4) Independence of xi . ui and xi are independent for all i.


This implies that x and u have separate and additive influence over Y and their
individual effects on Y can be assessed.

5) Normality. ui are normally distributed for all “i”. ui = IN(0, ϭ2 ).


This implies that ui are independently and normally distributed with a mean of 0 and
common variance of ϭ2 .
Sources of Error Term (u)
1) Unpredictable element of randomness in human response- If y =
consumption expenditure of a household and x = disposable income, there is
an unpredictable element of randomness in each household’s consumption.
The household does not behave like a machine. In one month people in the
household are on a spending spree while they may be tightfisted next month.

2) Effect of large number of variables that have been omitted- Disposable


income is not the only variable influencing consumption expenditure in the
above example. Family size, tastes of the family, spending habit may also
influence consumption expenditure.

3) Measurement error in y- in this example measurement error is in


consumption expenditure, i.e., we cannot measure consumption expenditure
accurately.
Parameter Estimate Choices
 In most business research, the estimate of β is most important. The
explanatory power of regression rests with β because this is where the direction
and strength of the relationship between the independent and dependent variable is
explained.

 An intercept term (α) is sometimes referred to as a constant because α


represents a fixed point.

 Parameter estimates can be presented in either raw or standardized form.

 One potential problem with raw parameter estimates is due to the fact that they
reflect the measurement scale range. So, if a simple regression involved distance
measured with miles, very small parameter estimates may indicate a strong
relationship. In contrast, if the very same distance is measured with centimeters, a
very large parameter estimate would be needed to indicate a strong relationship.
Standardized β Coefficient
 Standardized β coefficient are the estimated coefficient indicating the strength of
relationship expressed on a standardized scale where higher absolute values
indicate stronger relationships (range is from –1 to 1).

 A standardized regression coefficient provides a common metric allowing


regression results to be compared to one another no matter what the original scale
range may have been.

 Raw regression weights (β) have the advantage of retaining the scale metric.

 Standardized β coefficient should be explained when the researcher is testing an


explanation but not a prediction. Un-standardized weights are used when researcher
makes some predictions.
Conducting Bivariate Regression Analysis: Determine the Strength and
Significance of Association

• A test for examining the significance of the linear relationship between X and Y (significance of b) is the
test for the significance of the coefficient of determination. The hypotheses in this case are:

H0: R2pop = 0

H1: R2pop > 0

The appropriate test statistic is the F statistic: F = S S re g


S S re s / ( n - 2 )

which has an F distribution with 1 and n - 2 degrees of freedom. The F test is a generalized form of the t test
(see Chapter 15). If a random variable is t distributed with n degrees of freedom, then t2 is F distributed
with 1 and n degrees of freedom.

H 0: b 1 = 0
H 1: b 1 ¹ 0

H 0: r = 0
H 1: r ¹ 0
Multiple Regression
Y = β0 + β1X1 + β2X2 + . . . + βnXn

Y = Dependent Variable = # of credit cards


β0 = intercept (constant) = constant number of credit cards
independent of family size and income.
β1 = change in # of credit cards associated with a unit change in
family size (regression coefficient).
β2 = change in # of credit cards associated with a unit change in
income (regression coefficient).
X1 = family size
X2 = income
Sample Size Considerations
• The minimum ratio of observations to variables is 10 to 1,
but the preferred ratio is 15 or 20 to 1, and this should
increase when stepwise estimation is used.
Regression Analysis Terms
• Explained variance = R2 (coefficient of determination).

• Unexplained variance = residuals (error).

• Adjusted R-Square = reduces the R2 by taking into account the sample size and the number of
independent variables in the regression model (It becomes smaller as we have fewer observations per
independent variable).

• Standard Error of the Estimate (SEE) is a measure of the accuracy of the regression predictions. It
estimates the variation of the dependent variable values around the regression line. It should get smaller
as we add more independent variables, if they predict well. When the SEE is very small one would
therefore expect to see that most of the observed values cluster fairly close to the regression line and vice
versa.

• Standard error of the regression coefficient- standard error of the beta coefficient gives us an indication
of how much the point estimate is likely to vary from the corresponding population parameter. It measures
the amount of sampling error.
 
Figure 1. Low S.E. estimate – Predicted Y values close to regression line
 
 
 

Figure 2. Large S.E. estimate – Predicted Y values scattered widely above and below
regression line
• Total Sum of Squares (SST) = total amount of variation that exists to be
explained by the independent variables. TSS = the sum of SSE and SSR.

• Sum of Squared Errors (SSE) = the variance in the dependent variable not
accounted for by the regression model = residual. The objective is to obtain
the smallest possible sum of squared errors as a measure of prediction
accuracy.

• Sum of Squares Regression (SSR) = the amount of improvement in


explanation of the dependent variable attributable to the independent
variables.
Statistical vs. Practical Significance?

• The F statistic is used to determine if the overall regression


model is statistically significant.

• A large R2 indicates the straight line works well while a small


R2 indicates it does not work well.

• We also must ask whether the results are meaningful. For


example, is the value of knowing you have explained 4 percent
of the variation worth the cost of collecting and analyzing the
data?
VIOLATIONS OF

REGRESSION

ASSUMPTIONS
Heteroskedasticity
The second assumption of Regression is Common variance , i.e., Var(ui ) = ϭ2 for all “i”. This assumption is also
known as homoskedasticity [equal (homo) spread (skedasticity)].

Violation of this second assumption of regression is called heteroskedasticity, i.e., the errors don't have a constant/
common variance.

In Fig 1, the conditional variance of Yi (which is equal to that of ui ), which is conditional upon the given Xi , remains
the same regardless of the values taken by the variable X. In contrast, consider the Fig 2, which shows that the
conditional variance of Yi increases as X increases. Here, the variances of Yi are not the same. Hence, there is
Consequences of Heteroskedasticity

1) The OLS estimators have large variances, making


precise estimation difficult.
2) Wider confidence interval leading to the
acceptance of the “zero null hypothesis” (i.e., the
true population coefficient is zero) more readily.
3) Also because of consequence 1, the t ratio of one
or more coefficients tends to be statistically
insignificant even though in actuality they are
significant.
Multicollinearity
 The situation where the explanatory variables are highly inter-
correlated is referred to as multicollinearity.

 When the explanatory variables are highly correlated it becomes difficult


to disentangle the separate effects of each of the explanatory variables on
the explained variables.

 If multicollinearity is perfect , the regression coefficients of the X


variables are indeterminate and their standard errors are infinite.

 If multicollinearity is less than perfect, the regression coefficients,


although determinate, possess large standard errors (in relation to the
coefficients themselves), which means the coefficients cannot be
estimated with great precision or accuracy.
Multicollinearity Diagnostic
Variance Inflation Factor (VIF) – measures how much the variance of the
regression coefficients is inflated by multicollinearity problems. If VIF equals 0,
there is no correlation between the independent measures. A VIF measure of 1 is
an indication of some association between predictor variables, but generally not
enough to cause problems. A maximum acceptable VIF value would be 10;
anything higher would indicate a problem with multicollinearity.

Tolerance – the amount of variance in an independent variable that is not


explained by the other independent variables. If the other variables explain a
lot of the variance of a particular independent variable we have a problem with
multicollinearity. Thus, small values for tolerance indicate problems of
multicollinearity. The minimum cutoff value for tolerance is typically .10. That
is, the tolerance value must be smaller than .10 to indicate a problem of
multicollinearity.
Consequences of Multicollinearity
1. Although BLUE, the OLS estimators have large variances and covariances, making
precise estimation difficult.

2. Because of consequence 1, the confidence intervals tend to be much


wider, leading to the acceptance of the “zero null hypothesis” (i.e., the true
population coefficient is zero) more readily.

3. Also because of consequence 1, the t ratio of one or more coefficients tends to be


statistically insignificant.

4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the
overall measure of goodness of fit, can be very high.

5. The OLS estimators and their standard errors can be sensitive to small changes in the
data.
Autocorrelation
 The term autocorrelation may be defined as “correlation between members of
series of observations ordered in time [as in time series data] or space [as in
cross-sectional data].”

 The third assumption of regression implies Independence, i.e., ui and uj are


independent for all i ≠j, i.e., E(ui and uj ) = 0 for all i ≠j.

 However under autocorrelation E(ui and uj ) ≠ 0 for all i ≠ j.

 Difference Between Auto and Serial Correlation- correlation between two


time series such as u1, u2, ... , u10 and u2, u3, ... , u11 , where the former is the
latter series lagged by one time period, is autocorrelation, whereas correlation
between time series such as u1, u2, ... , u10 and v1, v2, ... , v3, where u and v
are two different time series, is called serial correlation.
Consequences of Autocorrelation
1. Absence of BLUE, and the confidence intervals
tend to be much wider, leading to the acceptance of
the “zero null hypothesis” (i.e., the true population
coefficient is zero) more readily.
DUMMY VARIABLE
REGRESSION
Introduction
 In regression analysis the dependent variable is frequently
influenced not only by ratio scale variables (e.g., income, output,
prices, costs, height, temperature) but also by variables that are
essentially qualitative, or nominal scale, in nature, such as
gender, race, color, religion, nationality, geographical region,
political upheavals, and party affiliation, firm type, consumer
type .

 Since such variables usually indicate the presence or absence of


a “quality” or an attribute, such as male or female, black or
white, Catholic or non-Catholic, Democrat or Republican, they
are essentially nominal scale variables.
 One way we could “quantify” such attributes is by constructing artificial variables that
take on values of 1 or 0, 1 indicating the presence (or possession) of that attribute and 0
indicating the absence of that attribute.

 Variables that assume such 0 and 1 values are called dummy variables. Such variables
are thus essentially a device to classify data into mutually exclusive categories.

 A regression model may contain independent variables that are all exclusively dummy, or
qualitative, in nature. Such models are called Analysis of Variance (ANOVA) models.

 Dummy variables are also called binary, dichotomous, instrumental or qualitative


variables. They are variables that may take on only two values, such as 0 or 1.

 The general rule is that to re-specify a categorical variable with K categories, K – 1


dummy variables are needed. The reason for having K – 1, rather than K, dummy
variables is that only K – 1 categories are independent.
 In a survey of consumer preferences for frozen foods, the respondents were classified as
heavy users, medium users, light users and non-users, and they were originally assigned
codes of 4, 3, 2 and 1, respectively.

 This coding was not meaningful for several statistical analyses. To conduct these analyses,
product usage was represented by three dummy variables, X1, X2 and X3, as shown below-
 Suppose that the researcher was interested in running a regression
analysis of the effect of attitude towards the brand on product use. The
dummy variables D1,D2 and D3 would be used as independent
variables. Regression with dummy variables would be modeled as:-
Yi = a+b1D1+b2D2+b3D3
 In this case, ‘heavy users’ have been selected as a reference category and have not been
directly included in the regression equation. Note that for heavy users, D1, D2 and D3 assume
a value of 0, and the regression equation becomes Yi = a

 For non-users, D1 = 1 and D2 = D3 = 0, and the regression equation becomes Yi = a + b1

 Thus, the coefficient b1 is the difference in predicted Yi for non-users, as compared with heavy
users. The coefficients b2 and b3 have similar interpretations. Although heavy users was
selected as a reference category, any of the other three categories could have been selected for
this purpose.
 Where you have a dummy variable for each category or group and also an intercept, you have a case
of perfect collinearity, that is, exact linear relationships among the variables. Such a is called the
dummy variable trap, that is, the situation of perfect collinearity or perfect multicollinearity, if there is
more than one exact relationship among the variables.

 The message here is: If a qualitative variable has m categories, introduce only (m − 1) dummy
variables.

 The category for which no dummy variable is assigned is known as the base, benchmark, control,
comparison, reference, or omitted category. And all comparisons are made in relation to the
benchmark category.

 The intercept value (β1) represents the mean value of the benchmark category.

 The coefficients attached to the dummy variables are known as the differential intercept
coefficients because they tell by how much the value of the intercept that receives the value of 1
differs from the intercept coefficient of the benchmark category.

You might also like