Professional Documents
Culture Documents
REGRESSION ANALYSIS
Measure of Association
Correlation analysis is most appropriate for interval or ratio variables. Regression can
accommodate either less-than interval independent variables, but the dependent variable
must be continuous.
The chi-square (χ2) test provides information about whether two or more less-than
interval variables are interrelated.
Covariance coefficients retain information about the absolute scale ranges and
so the strength of association for scales of different possible values cannot be
compared directly. This problem does not persist with correlation as it is
standardized (covariance) and hence independent of the units of measurement.
The correlation coefficient, r, ranges from –1.0 to +1.0.
If the value of r = +1.0, a perfect positive relationship exists. Perhaps the two
variables are one and the same!
No correlation is indicated if r = 0.
The main diagonal consists of correlations of 1.00. Also REMEMBER that correlations
should always be considered with significance levels (p- values). If a correlation is not
significant it is of little use.
Coefficient of Determination (R2)
2 E x p la in e d v a r ia tio n
R =
T o ta l v a r ia tio n
S S x
=
S S y
= T o ta l v a r ia tio n - E r r o r v a r ia tio n
T o ta l v a r ia tio n
S S y - S S e rro r
=
S S y
Regression Analysis
Regression analysis is a technique for measuring the linear
association between a dependent and an independent variable.
Standardization is the process by which the raw data are transformed into new variables that
have a mean of 0 and a variance of 1.
When the data are standardized, the intercept assumes a value of 0.
The statistical significance of the linear relationship between X and Y may be tested by examining
the hypotheses: H0: b1 = 0
H1: b1 ¹ 0
t= b
A t statistic with n - 2 degrees of freedom can be used, where
SEb
SEb denotes the standard deviation of b and is called the standard error.
Assumptions of Regression
One potential problem with raw parameter estimates is due to the fact that they
reflect the measurement scale range. So, if a simple regression involved distance
measured with miles, very small parameter estimates may indicate a strong
relationship. In contrast, if the very same distance is measured with centimeters, a
very large parameter estimate would be needed to indicate a strong relationship.
Standardized β Coefficient
Standardized β coefficient are the estimated coefficient indicating the strength of
relationship expressed on a standardized scale where higher absolute values
indicate stronger relationships (range is from –1 to 1).
Raw regression weights (β) have the advantage of retaining the scale metric.
• A test for examining the significance of the linear relationship between X and Y (significance of b) is the
test for the significance of the coefficient of determination. The hypotheses in this case are:
H0: R2pop = 0
which has an F distribution with 1 and n - 2 degrees of freedom. The F test is a generalized form of the t test
(see Chapter 15). If a random variable is t distributed with n degrees of freedom, then t2 is F distributed
with 1 and n degrees of freedom.
H 0: b 1 = 0
H 1: b 1 ¹ 0
H 0: r = 0
H 1: r ¹ 0
Multiple Regression
Y = β0 + β1X1 + β2X2 + . . . + βnXn
• Adjusted R-Square = reduces the R2 by taking into account the sample size and the number of
independent variables in the regression model (It becomes smaller as we have fewer observations per
independent variable).
• Standard Error of the Estimate (SEE) is a measure of the accuracy of the regression predictions. It
estimates the variation of the dependent variable values around the regression line. It should get smaller
as we add more independent variables, if they predict well. When the SEE is very small one would
therefore expect to see that most of the observed values cluster fairly close to the regression line and vice
versa.
• Standard error of the regression coefficient- standard error of the beta coefficient gives us an indication
of how much the point estimate is likely to vary from the corresponding population parameter. It measures
the amount of sampling error.
Figure 1. Low S.E. estimate – Predicted Y values close to regression line
Figure 2. Large S.E. estimate – Predicted Y values scattered widely above and below
regression line
• Total Sum of Squares (SST) = total amount of variation that exists to be
explained by the independent variables. TSS = the sum of SSE and SSR.
• Sum of Squared Errors (SSE) = the variance in the dependent variable not
accounted for by the regression model = residual. The objective is to obtain
the smallest possible sum of squared errors as a measure of prediction
accuracy.
REGRESSION
ASSUMPTIONS
Heteroskedasticity
The second assumption of Regression is Common variance , i.e., Var(ui ) = ϭ2 for all “i”. This assumption is also
known as homoskedasticity [equal (homo) spread (skedasticity)].
Violation of this second assumption of regression is called heteroskedasticity, i.e., the errors don't have a constant/
common variance.
In Fig 1, the conditional variance of Yi (which is equal to that of ui ), which is conditional upon the given Xi , remains
the same regardless of the values taken by the variable X. In contrast, consider the Fig 2, which shows that the
conditional variance of Yi increases as X increases. Here, the variances of Yi are not the same. Hence, there is
Consequences of Heteroskedasticity
4. Although the t ratio of one or more coefficients is statistically insignificant, R2, the
overall measure of goodness of fit, can be very high.
5. The OLS estimators and their standard errors can be sensitive to small changes in the
data.
Autocorrelation
The term autocorrelation may be defined as “correlation between members of
series of observations ordered in time [as in time series data] or space [as in
cross-sectional data].”
Variables that assume such 0 and 1 values are called dummy variables. Such variables
are thus essentially a device to classify data into mutually exclusive categories.
A regression model may contain independent variables that are all exclusively dummy, or
qualitative, in nature. Such models are called Analysis of Variance (ANOVA) models.
This coding was not meaningful for several statistical analyses. To conduct these analyses,
product usage was represented by three dummy variables, X1, X2 and X3, as shown below-
Suppose that the researcher was interested in running a regression
analysis of the effect of attitude towards the brand on product use. The
dummy variables D1,D2 and D3 would be used as independent
variables. Regression with dummy variables would be modeled as:-
Yi = a+b1D1+b2D2+b3D3
In this case, ‘heavy users’ have been selected as a reference category and have not been
directly included in the regression equation. Note that for heavy users, D1, D2 and D3 assume
a value of 0, and the regression equation becomes Yi = a
Thus, the coefficient b1 is the difference in predicted Yi for non-users, as compared with heavy
users. The coefficients b2 and b3 have similar interpretations. Although heavy users was
selected as a reference category, any of the other three categories could have been selected for
this purpose.
Where you have a dummy variable for each category or group and also an intercept, you have a case
of perfect collinearity, that is, exact linear relationships among the variables. Such a is called the
dummy variable trap, that is, the situation of perfect collinearity or perfect multicollinearity, if there is
more than one exact relationship among the variables.
The message here is: If a qualitative variable has m categories, introduce only (m − 1) dummy
variables.
The category for which no dummy variable is assigned is known as the base, benchmark, control,
comparison, reference, or omitted category. And all comparisons are made in relation to the
benchmark category.
The intercept value (β1) represents the mean value of the benchmark category.
The coefficients attached to the dummy variables are known as the differential intercept
coefficients because they tell by how much the value of the intercept that receives the value of 1
differs from the intercept coefficient of the benchmark category.