You are on page 1of 44

Regression Analysis

Dr Sunil D Lakdawala
Sunil_lakdawala@hotmail.com
Regression Line

Equation of Line Y = a + b*X


a: Intercept, b: Slope
Draw line passing through (1,5) and (2,7)
Find out a and b
Predict value of Y, for X = 5
Is relation direct?
Similarly draw line passing through (0,6) and
(1,3)
Find a and b. Is relation direct?
25-Aug-17 2
Simple Regression

It is a bivariate linear regression - that is, it is


the process of constructing a mathematical
model or function that can be used to predict
or determine one variable by another
variable.
The variable to be predicted is called the
dependent variable and is denoted by Y.
The predictor is called the independent
variable or explanatory variable, and is
denoted by X.
25-Aug-17 3
Regression Line (Cont)

Drawing regression line for scatter chart


See # of Passengers vs cost
Find our error, absolute error and square error
Can we take error? Absolute error? Square error?
What are characteristics?
Draw line such that square error is least
The equation of the simple linear regression line is
given by
Yi=a +bXi+
Minimize 2 by finding best fit for a, b

Find out value of a and b for minimizing square error


25-Aug-17 4
Problem

Let us consider the data


displayed in the table.
The values in the 1st
column denote number of
passengers for 12 five-
hundred-mile commercial
airline flights using Boeing
737s during the same
season of the year.
We use these data to
develop a regression model
to predict cost by number of
passengers.

25-Aug-17 5
Regression Line (Cont)
b = (Xi*Yi n*X*Y) / (Xi2 n*X2)
a = Y - b* X
Se Standard Error = Sqrt((Yi-Yp)2/(n-2))
Assumption: Errors are normally distributed
What is interpretation of standard error. In
comparison with mean value of Y? In terms
of percentage?
Look at example of Cost vs Passengers

25-Aug-17 6
Regression Line (Cont)
Interpretation of Standard Error (Cont)
what is the range of cost for 80 passengers
with 95% confidence and 90% confidence
For n > 30, use Z distribution
with For n < 30, normal distribution can not
be assumed. Need to take t distribution
What will be the range for the above
problem?
Range is same for Y. Is it true?
Assumption: One is predicting within the
range
25-Aug-17 7
Correlation Analysis
Degree to which one variable is linearly related to
another
Coefficient of Determination:
r2 = 1 (Yi Yp)2 / (Yi Y)2 (between 0 and 1)
= 1 - Ratio
Ratio : variation between actual and predicted value
w. r. t. variation of Yi from mean (Unexplained part)
Variation of Y around regression line
Variation of Y around its own mean
r2 = 1 - ratio of above two
r2 = 0.78 78% variation of Y from Y is explained
using regression. 22% is not explained
25-Aug-17 8
INTERPRETATION of r 2

If first term is zero, both the terms are exactly


linearly related, and value will be 1
If first term = second term, there is no
relation, and value is 0
Take example of table 12-6 (DO), table 12-
13, figure 12-13, figure 12-14 to find r2
(HENKE)
Calculate values using excel

25-Aug-17 9
Coefficient of Correlation

r = sqrt (r2)
See fig 12.16
If r = 0.6, how good is regression? How much
variation in Y is explained by regression?

25-Aug-17 10
Inferences about population parameters
Instead of point value of b, we want to find
out range of b with 90% confidence level
Find t for given degrees of freedom, and
given confidence level
Find sb (standard error for b)
Sb = Se / (Xi2 nX2)
Range is b t*sb and b + t*sb

25-Aug-17 11
The Equation

The equation of the simple linear regression line is given


by
Yi=0+1Xi+
Minimize 2 by finding best fit for 0, 1
Table 5-2 Pulp Price Regression Makridakis
Table 5-4 PVC Regression Makridakis
Residual / Error
Outliers: Observation with large residuals (Table 5-4)
Influential Observations: Observation that have a great
influence on the fitted equation. Usually they are extreme
observation (. (See King-Kong problem Fig 5-8
Makridakis)
25-Aug-17 12
The Equation (Cont)

Causation vs Correlation: X: Weekly # of deaths by


drowning and Y: Consumption of coke might be highly
related, but there may not be causal relationship Both
might be increasing in summer!
Lurking Variable : Explanatory variable not included in
regression that is highly related to both X and Y (e.g.
Season in above case
Confounding Variable: New Car Sales may depend
upon both Price as well as Advertisement
Expenditure. Last two are called Confounding variable

25-Aug-17 13
The Equation (cont)
The regression analysis is performed under following assumptions.
Y = 0 + 1 * X +
1. Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue, denoted by

2. Residual
Plot X vs (Y- ^Y). should be Random (i.e. normally distributed)
and should not have any trend (unlike Y = X**2)
3. Standard Error Se = (SUM Squared Error/(N-K-1); K = 1 for
Simple Regression; SUM Squared Error = (Y- ^Y) **2
Se should be acceptable (Se / Ymean give good idea of error)
68% of Residue should be within Se
95% of Residue should be within 2* Se

25-Aug-17 14
The Equation (Cont)
4. Correlation (r) is measure of linear association-ship
between the variables. Even if variables have high
nonlinear relationship, r might be very small (see Fig
5-7 Makridakis)
5. For small n, r is notoriously unstable. For n = 30 or
more, it starts becoming stable
6. r can change drastically due to extreme values. (See
King-Kong problem Fig 5-8 Makridakis, where just one
extreme point changes r from 0.527 to 0.940). What
should we do?
7. Coefficient of Determination: R**2 should be high,
towards 1. Interpretation of R2 and r

25-Aug-17 15
The Equation (Cont)
8. P value should be smaller than 0.05 (i.e. 95%
confidence) for rejecting null Hypothesis (0 =0 / 1 =
0)
F = t**2 = MS (Regression) / MS (Residual) (t value
for 1)
Significance F = p value for 1

9. Adjusted R**2 =
1 (Sum Squared Error/ (N-K-1)*(N-1)/ (X-X mean
)(Y-Y mean )
10. Should make common sense i.e. when X change by
1, Y changes by Slope. +ve or ve change should
make common sense
11. Only prediction valid within the range from which model
is made
25-Aug-17 16
The Equation (Cont)
12. Please see equation 5.19 for error interval on predicted
value
13. Please see equations for and and their error
interval on page 216
14. Residues vs explanatory variable should not have any
pattern (No trend, No seasonality, etc..)
15. Residues should have mean as zero and should be
normally distributed

25-Aug-17 17
Data and Analysis

25-Aug-17 18
Summary Output

25-Aug-17 19
Residuals

It is the
difference
between the
actual Y value
and predicted Y
value by the
regression
model in
predicting each
value of the
dependent
variable.

25-Aug-17 20
Residuals

The total of the


residuals squared is
called the sum of
squares of error
(SSE).
The standard error
of the estimate is a
standard deviation
of the error of the
regression model

25-Aug-17 21
Coefficient of Determination

A widely used measure of fit for regression


models is the coefficient of determination.
The coefficient of determination is the
proportion of variability of the dependent
variable (Y) accounted for or explained by the
independent variable (X).
It is denoted by r2.
It lies between 0 and 1.

25-Aug-17 22
r 2 in Airlines Cost

r2 = .899 [pg6,12,13]
This means that about 89.9% of the variability
of the cost of flying a Boeing 737 airplane on
a commercial flight is accounted for or
predicted by the number of passengers.
This also means that about 11.1% of the
variation in airline flight cost, Y, is
unaccounted for by X or unexplained by the
regression model.
25-Aug-17 23
Correlation

It is a measure of
association. It measures
the strength of relatedness
of two variables.
For example, we may be
interested in determining
the correlation between
the prices of two stocks in
the same industry
How strong are these
correlations?
The Pearson product -
moment correlation
coefficient is given by.

25-Aug-17 24
Correlation
1. The measure is applicable only
if both variables being analyzed
have at least an interval level of
data.
2. r is a measure of the linear
correlation of two variables.
3. r = +1 denotes a perfect positive
relationship between two sets of
variables.
4. r = -1$ denotes a perfect
negative correlation, which
indicates an inverse relationship
between two variables.
5. r=0 means that there is no linear
relationship between the two
variables.
6. The coefficient of determination
= (correlation coefficient) r2

25-Aug-17 25
25-Aug-17 26
Factors to be taken care of

Plot of residual vs x should be healthy (Carry


out the example Y = SQR (X)
Do not try to predict Y, outside the range from
the ones used for building model (Try
predicting for Y = SQR (X)

25-Aug-17 27
Multiple Regression Model
The general equation which describes multiple regression
model is given by
Yi=0+1Xi + 2X2 + kXk +
Minimize 2 by finding best i
Assumptions made in the model are :
1. Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue,
denoted by
Plot Xi vs (Y- ^Y) for each Xi. should be Random and should
not have any trend (unlike Y = X**2)
2. Standard Error Se = (SUM Squared Error/(N-K-1); K = # of
independent variables; SUM Squared Error = (Y- ^Y) **2
Se should be acceptable (Se / Ymean give good idea of
error)
68% of Residue should be within Se
95% of Residue should be within 2* Se

25-Aug-17 28
Multiple Regression Model (Cont)

3. Coefficient of Determination: R**2 should be high, towards 1


4. P value should be smaller than 0.05 (i.e. 95% confidence)
for rejecting null Hypothesis (i = 0)
F = t**2 = MS (Regression) / MS (Residual) (t value
overall)
Significance F = p value overall. For rejecting null
hypothesis (1=0 and 2=0 ..), this should be small
5. Should make common sense i.e. when X change by 1, Y
changes by Slope. +ve or ve change should make common
sense
6. Adjusted R**2 should not be very different from R**2. By
adding more variables, one can always make R**2 large.
But Adjusted R**2 might be much smaller than R**2
25-Aug-17 29
Problem
See Fig 6-1 Bankdata Regression
A real estate study was conducted
in a small city to determine what
variables, if any, are related to the
market price of a home.
Several variables were explored,
including the number of bedrooms,
the number of bathrooms, the age of
the house, the number of square
feet of living space, the total number
of square feet of space, and how
many garages the house had.
Suppose that the business analyst
wants to develop a regression
model to predict the market price of
a home by two variables: ``total
number of square feet in the house''
and the age of the house.
The data are given in the table.

25-Aug-17 30
The Fitted Model
Y = 57.351 + 0.0177X1 0.6663X2
Interpretation:
The Y- intercept is equal to 57.351. In this example, Y-intercept
does not have any practical significance.
The coefficient of X1 (total number of square feet in the house) is
0.0177. This means that 1-unit increase in square footage would
result in predicted increase of (0.0177) ($1000) = $17.70 in the
price of the home if the age were held constant.
The coefficient of X2 (age) is -0.6663. The negative sign on the
coefficient denotes an inverse relationship between the age of a
house and the price of the house : the older the house, the lower
the price. In this case, if the total number of square feet in the
house is kept constant, a 1-unit increase in the age of the house
(1 year) will result in (-0.6663) ($1000) = - 666.30, a predicted
drop in the price.

25-Aug-17 31
Testing the Model

r2 = 0.715

Testing the
overall model

Significance
Tests of the
Regression
Coefficients

25-Aug-17 32
25-Aug-17 33
Analysis of Residuals

25-Aug-17 34
Multicollinearity

Multicollinearity refers to two or more independent


variables of a multiple regression model being highly
correlated. This causes problems in the
interpretation of results.
In particular, these problems are
It is difficult, if not impossible, to interpret the estimates of
the regression coefficients.
Inordinately small t values for the regression coefficients
may result.
The standard deviations of regression coefficients are
overestimated.
The algebraic sign of estimated regression coefficients may
be the opposite of what would be expected for a particular
predictor variable. [pg24]

25-Aug-17 35
Search Procedures
All possible regression
Take all possible combination of K variables (2K -1 models).
Choose the best model
Forward selection
Start with one variable. Try out all variable one at a time. Choose
the best one.
Then take 2nd variable, and so on.

Choose the best model

Backward Elimination:
Start with all variables

Eliminate the one with smallest t

Keep on repeating

Stepwise regression
Same as Forward Selection, but at every time also check that the
variable included is significant (acceptable p value)

25-Aug-17 36
Factors to be taken care of
Value of R2 could be inflated. Consider R2 adjusted
Better model does not imply cause and effect between
independent variables and dependent variable (some
other factors might be causing both)
Value of regression coefficient may not directly tell about
the importance, because
Different Units
Multi co-linearity

Multi co-linearity can create problem. To address the


same,
Use Search Procedures
Use r between two variables. For larger value, do not
take both

25-Aug-17 37
Non Linear Models (5/4 - Makridakis)
Nonlinearity in parameters More complex (One may be
able to use transformation to convert into linear in certain
cases)
Nonlinearity in variables
Local Regression (see 5/4/3 of Makridakis)

25-Aug-17 38
Non-Linear Model
Y=0+1X1+ 2X2 **2 +
Choose Y1 = X1 ;Y2 = X2 **2

Y=0+1X1+ 2X1X2 +
Choose Y1 = X1 ;Y2 = X1 * X2

Y = 0*1X
Log(Y) = Log(0) +X*Log(1); Now it is in
linear form
Similarly Y = 0*X1 ; Y = 1 / (0+1X1+ 2X2 )
can be converted into linear regression

25-Aug-17 39
Indicator (Dummy) Variable

Regression required numeric value


For Gender, define Gender = 1 for Male and Gender
= 0 for Female
For Region (N/S/W/E), do not map to 0, 0.33, 0.66
and 1.0 (Why? .. Unordered)
Define three variable, X1, X2 and X3
X1 = 1 for North Region, 0 other wise
X2 = 1 for South Region, 0 other wise
X3 = 1 for West Region, 0 other wise
X1 = X2 = X3 = 0 represents East Region

25-Aug-17 40
Others (Pg 270 Makridakis)
Trading day variation
Introduce seven variables, T1: # of Mondays in Month, T2: # of
Tuesdays in Month, ..

Holiday Effect
V=1 if Diwali falls in this month (or part of Diwali)

25-Aug-17 41
Interventions (Pg 271 Makridakis)
Seat belt legislation was introduced
Due to that car accidents went down
Introduce dummy variable I = 0 (Before seat belt
legislation), and I =1 (after seat belt legislation)
More complex models can be introduced, if effect is
spread over some time
See figure 8-15 for intervention variable

25-Aug-17 42
Effect of Advertising Expenditure on Sale (Pg 271 Makridakis)

Monthly Sale is output variable and monthly


advertisement expense is one of the input variable
Effect of advertisement expense lasts till 3 months (say)
One can model as follows:
Yt = b0 + b1*X1,t + b2*X1,t-1 + b3*X1,t-2 + ..

25-Aug-17 43
Miscellaneous
Variance - Covariance Matrix
Vector X = (X1, X2, X3, )
Let (i) be arithmetic average of X(i)
(i.j) = k (X(i,k) - (i))*(X(j,k) - (j))/ N
(i.i) is Variance Matrix

25-Aug-17 44