You are on page 1of 69

Correlation & Regression Analysis

Sahadeb Sarkar
Operations Management Group, IIM Calcutta
Readings
Regression Based Business Forecasting:
Covariance & Correlation Coefficient (Sec 3.5,
5.2), Simple Linear Regression (Sec 13.1-13.6),
Multiple Linear Regression (Sec 14.1, 14.2, 14.6
(Dummy Var Reg), 15.1 (Polynomial Reg), 15.2)

Text book: “Statistics for Managers using


Microsoft Excel” Levine, Stephan & Szabat, 8th
ed.
Regression: Modeling “Effect” Y using “Cause(s)” or
Proxy for Cause(s) X
Effect (Y) and “Cause” (X):
• (Market Share, Size of Quality Sales Force)
• (Sales, Adv Exp)
• (Amount of sales, No. of customers visiting store)
• (Closing Sensex next day, Closing DJIA (Dow Jones Industrial
Average) Index)
• (Performance in Competitive Exam and I.Q.)
• (Gold Price and Rate of Inflation)
• (Percentage of Defectives in the output and Speed of Conveyor
Belt)
Note: Y: known as “Effect/Response/Target/Dependent” variable
X: known as “Predictor/Explanatory/Independent/’Cause’/Proxy for
3
‘Cause’ ” variable
“Multiple Causes-and-Effect” Model

• Sales of a company; Exp on Adv (TV, Radio, Print


Media), Personnel Cost
• Sales of a store; Promotional Expenses, Discounts
• Performance evaluation; Aptitude, Job experience, MBA
or not
• Car mileage; Horse power, Weight, Weather
(temperature), Driving behavior

4
Simple Linear Regression Model

Y Yi  a  bX i  ei
Observed Value
of Y for Xi

ei Slope = b
Predicted Value Random Error
of Y for Xi
for this Xi value

Intercept = a

Xi X
Simple Linear Regression: Assumptions
• The relationship between X
Y
Assumptions of the Simple
Linear Regression Model
and Y is a straight-line
relationship.
• The values of the
independent variable X are
assumed fixed (not random);
E[Y]=0 + 1 X
the only randomness in the
values of Y comes from the
error term i.
• The errors ei are normally
distributed with mean 0 and
variance 2. The errors are at Identical normal
distributions of
least uncorrelated (not errors, all centered
on the regression
related) in successive line.
observations. That is: ei ~
N(0,2) X

Suppose Y=Sales and X=AE (Adv Exp); we assume that for a given AE value the
conditional distribution of possible sales values is normal and expected (average)
Sales change as a linear function of AE but the variation in Sales remain unchanged.
Trent
Trent is a retail operations company that owns and
manages a number of retail chains in India. Established in
1998, Trent runs lifestyle chain Westside, one of India’s
largest and fastest growing chain of lifestyle retail stores,
Star Bazaar, a hypermarket chain, Landmark, a books and
music chain, and Fashion Yatra, a complete family fashion
store.

7
Example: Data on Quarterly Advertising Expense (X) and
Quarterly Net Sales (Y) for Trent from Dec 2003 to Mar 2008.

Hold-Out or
Validation Sample,
set aside to check 8
prediction ability
Different Regression Models
Simple Linear Regression:
(1) Sales = a + b*AE, AE= Adv Exp
(2) Sales = a + b*PC, PC= Personnel Cost
Multiple Linear Regression:
(3) Sales = a + b1*AE + b2*PC
[Note 1: Adding (AE+PC) as a new predictor in the above model is not useful]
Polynomial Regression:
(4a) Sales = a + b1*PC + b2*(PC)2
(4b) Sales = a + b1*AE + b2*(AE)2
Dummy Variable Regression:
(5) Sales = a*1 + b*Time + c1*Q1+ c2*Q2+ c3*Q3, where Q1, Q2,Q3 are
indicator variables for quarters 1,2,3.
[ Note 2: a= a*1; Q1+Q2+Q3+Q4=1  Q4=1(Q1+Q2+Q3); adding Q4 in the above
model as a predictor is not useful when ‘constant term a’ and Q1, Q2,Q3 are present.]
9
Summary

R2 = proportion of variation in response variable explained by explanatory


variables; Adj R2 = Adjusted R2 is an improved version of R2; DW= Durbin-
Watson statistic to quantify ‘degree of linear relationship’ among regression
errors over time; MAPE = Mean (average) Absolute Percentage Error, where
|(𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 −𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)| 10
APE(absolute percentage error) = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
× 100
Simple Linear Reg: Sales = a + b*AE

Data: (x1,y1), …, (xn,yn)


1. Is there a linear relationship between variables Net
Sales and Adv Exp ? [construct scatter plot of Y vs X]
2. If a linear relationship exists, how strong is it ? [calculate
correlation coefficient between Y and X]
3. If linear relationship strong, give an estimated linear
relationship of Net Sales on Adv Exp based on the
data [estimate a, b coefficients through ‘least squares’ of errors
method]
4. How good is the fit? [calculate R2 = proportion of variation in Y
values explained by the variation in X values through the linear
regression model]

11
130 12
120 Net sales 11
110 Adv Exp 10
100 9
Net Sales

Adv Exp
90 8
80 7
70 6
60 5
50 4
40 3
30 2
04

05

06
3

6
-0

-0

-0

-0
n-

n-

n-
ec

ec

ec

ec
Ju

Ju

Ju
D

12
Trent data: Correlation = 0.78

140

120

100
Net Sales

80

60

40

20
2 4 6 8 10 12
Adv Exp

13
Correlation = 0.97
140

120

100
Net Sales

80

60

40

20
2 3 4 Personnel
5 Cost 6 7 8

14
Measures of linear relationship:
(Sample) Covariance

1
Definition: Covariance = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1

1 (σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Shortcut Formula: σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛−1 𝑛

Note: When Y = X, then Covariance(X,Y) is Variance of X

Shortcut Formula for sample variance:


𝑛 𝑛 2
1 2
1 2
(σ𝑛𝑖=1 𝑥𝑖 )
෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 −
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛
2
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑥𝑖 ) (σ𝑛 𝑥)
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1 𝑖
𝑛 𝑛

15
Measures of linear relationship:
(Sample) Covariance

1
Covariance = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1

 Covariance positive: If usually above (below) average X values


occur when above (below) average Y values happen
 Covariance depends on units of X and Y. [Relation between GDP
and Inflation, and that between Price and Sales cannot be
compared with covariance]
 Its value ranges between – and ; no finite benchmark as to
how large is “large”

16
Measure of linear relationship: (Sample) Correlation

1
σ𝑛 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത Covariance(x,y)
𝑛−1 𝑖=1
𝑟= =
1 1 SD x SD(y)
σ𝑛 𝑥𝑖 −𝑥ҧ 2 σ𝑛 𝑦𝑖 −𝑦ത 2
𝑛−1 𝑖=1 𝑛−1 𝑖=1

(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛

2
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑥𝑖 ) (σ𝑛 𝑥)
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1𝑛 𝑖
𝑛

17
Measure of linear relationship: (Sample) Correlation

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2

(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛

2
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑥𝑖 ) (σ𝑛 𝑥)
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1𝑛 𝑖
𝑛

Hence, Short-cut Formula for Correlation:


(σ𝑛 𝑥𝑖 )(σ𝑛
𝑖=1 𝑦𝑖 )
σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1
𝑛
𝑟=
(σ𝑛 𝑥 )2 (σ𝑛 𝑦 )2
σ𝑛 2 𝑖=1 𝑖 σ𝑛 2 𝑖=1 𝑖
𝑖=1 𝑥𝑖 − 𝑛 𝑖=1 𝑦𝑖 − 𝑛
18
Measures of linear relationship:
(Sample) Correlation

 Correlation is unit-free; (Relation between GDP and Inflation,


and that between Price and Sales can be compared with
correlation)
 Its value between –1 and 1; has finite benchmark as to how large
is “large”

19
Calculation of r

Note: For calculation of variance, covariance, correlation


and almost all of the regression related statistics we do
not require raw data {(xi,yi)}, just need to know:
n, σ𝑛𝑖=1 𝑥𝑖 , σ𝑛𝑖=1 𝑥𝑖 2 , σ𝑛𝑖=1 𝑦𝑖 , σ𝑛𝑖=1 𝑦𝑖 2 , σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖

20
Correlation Analysis for Trent Data

Note: Going by the covariance as a measure of linear relationship can be


misleading as seen from this example. Here, covariance suggests Adv Exp as a
better linear predictor of sales than Personnel Cost; but common sense and the
21
scatterplots lead us to believe otherwise.
Scatter plots showing positive and
negative correlation

22
(Nearly) Zero Correlation

23
Interpretation of Correlation values
• Zero correlation does not mean no relationship [there
may be nonlinear relationship; correlation measures linear
relationship]

• Similarly, high correlation does not mean causal


relationship between variables [there could be hidden
variable(s) influencing both Y and X variables]

Example 1: X = No. of firemen fighting a fire, Y = damage from fire


(correlation usually positive, Hidden Factor: Severity of fire)
Example 2: X = Computer Sales, Y= Demand for Dental Care,
(Correlation usually positive, Hidden Factor: GDP growth)

24
Example 1: Quarterly Advertising Expense (X) and
Quarterly Net Sales (Y) for Trent from Dec-03 to Mar-07.
Predict sales for Jun-07 quarter if (approx) 9.58 crores
earmarked for Adv Exp for Jun-07.

25
Simple linear regression

• Model : Yi= (a + bXi) + ei, i=1,2,…,n


• “error” ei = part of Yi unexplained
– true relation between Y and X may be nonlinear
– Y depends not just on X but on other
unincluded variables
Example: Y=sales, X=adv. exp., (other explanatory
variables: personnel cost, price, GDP growth)
Least Squares Method: sum of ei2 over all time periods

26
Minimize SSE = i(yi-a-bxi)2, w.r.t. a, b
Simple Linear Regression Model: Yi= (a*1 + b*Xi) + ei.

There are two explanatory variables: 1 (constant) and X (nonconstant).


• Multiply yi= (a*1 + b*xi) by 1 throughout, then sum over i to get
equation (1) below;
• Multiply yi= (a*1 + b*xi) by xi throughout, then sum over i to get
equation (2) below. n n
na  b xi   yi (1)
i 1 i 1
n n n
a  xi  b xi   xi yi
2
(2)
i 1 i 1 i 1

i.e., in matrix notation we have


 n
  n 
 n  x i 
 a    i  y
 i 1
     i 1 
 n n
2  b   n

  xi  xi   i i 
x y
 i 1 i 1   i 1  27
Estimates of b, a
n

 ( x  x )(y
i i  y)
 sy 
b i 1
n
 ry , x  , where s y and s x are
 sx 
 i
( x
i 1
 x ) 2

sample standard deviations of y and x variabl es.

a  y b x
Predicted Value of y  a  bx  y  b (x  x )

Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො


(y-hat); for X=xi we write the predicted value of y as 𝑦ො𝑖
28
Short-Cut Formual: Estimates of b, a

a  y b x
Predicted Value of y  a  bx  y  b (x  x )

Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො


(y-hat); for X=xi we write the predicted value of y as 𝑦ො𝑖

29
Estimated Reg line (Trent data): Net Sales = 27.82 + 7.59
(Adv Exp);
(i) Interpretation of 𝑏෠ (b-hat), 𝑎ො (a-hat)??
(ii) For next quarter Jun-07, predicted net sales= 27.82 +
7.59*(9.58) = 100.56 crore (actual sales = 121.6)

30
Simple Linear Regression: Assumptions
• The relationship between X
Y
Assumptions of the Simple
Linear Regression Model
and Y is a straight-line
relationship.
• The values of the
independent variable X are
assumed fixed (not random);
E[Y]=0 + 1 X
the only randomness in the
values of Y comes from the
error term i.
• The errors ei are normally
distributed with mean 0 and
variance 2. The errors are at Identical normal
distributions of
least uncorrelated (not errors, all centered
on the regression
related) in successive line.
observations. That is: ei ~
N(0,2) X

Suppose Y=Sales and X=AE (Adv Exp); we assume that for a given AE value the
conditional distribution of possible sales values is normal and expected (average)
Sales change as a linear function of AE but the variation in Sales remain unchanged.
How Good is the Regression Fit?
• Total variation in y-values divided in two parts:

SST  SSR  SSE


Total Sum Regression Error Sum of
of Squares Sum of Squares
Squares
SST   ( Yi  Y)2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2
where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 32
Simple Linear Regression Model

Y Yi  a  bX i  ei
Observed Value
of Y for Xi

ei Slope = b
Predicted Value Random Error
of Y for Xi
for this Xi value

Intercept = a

Xi X
Measures of Variation

Y
Yi  
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2

Yi  _
_ SSR = (Yi - Y)2 _
Y Y

Xi X
R2 = proportion of variation in Y explained by variation
in explanatory variable(s) through regression relation

𝑆𝑆𝑅 𝑆𝑆𝐸
Definition: 𝑅2 = =1−( )
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐸/(𝑛−𝑘−1)
Definition: 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1−
𝑆𝑆𝑇/(𝑛−1)
where k = number of non-constant predictors and ‘1’ refers to
presence of a constant predictor;
Note: R2 involves ratio of SSE (total amount of variation in errors) to TSS
(total amount of variation in response Y values);
Adjusted R2 involves ratio of SSE/(n-k-1) (i.e., ‘average’ amount of
variation in errors) to SST/(n-1) (i.e., ‘average’ amount of variation in
response Y values).
(𝑛−1)
Verify: 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − (1 − 𝑅2 ) ;
(𝑛−𝑘−1)
35
R2 (‘Coefficient of Determination’) & Adjusted R2

R2 (Coefficient of Determination): proportion of


variation in Y explained by variation in explanatory
variable(s) through regression relation

Adjusted R2: Modified version of R2 which penalizes a


model for including redundant explanatory variables
and takes into account sample size

(i) 0  R2  1; (ii) Adjusted R2  R2 , (iii) Adjusted R2


may be negative. 36
Are Regression Errors “independent” ?
(Trent Data: SLR of Net Sales on Adv Exp)
SLR-AE (DW=1.33)

50
40
30
20
Res

10
0
-10 40 50 60 70 80 90 100 110 120
-20
-30
Pred

37
Simple Linear Regression: Assumptions
• The relationship between X
Y
Assumptions of the Simple
Linear Regression Model
and Y is a straight-line
relationship.
• The values of the
independent variable X are
assumed fixed (not random);
E[Y]=0 + 1 X
the only randomness in the
values of Y comes from the
error term i.
• The errors ei are normally
distributed with mean 0 and
variance 2. The errors are at Identical normal
distributions of
least uncorrelated (not errors, all centered
on the regression
related) in successive line.
observations. That is: ei ~
N(0,2) X

Suppose Y=Sales and X=AE (Adv Exp); we assume that for a given AE value the
conditional distribution of possible sales values is normal and expected (average)
Sales change as a linear function of AE but the variation in Sales remain unchanged.
Durbin Watson (DW) Statistic
• DW =
sum of squares of (residual – previous residual),
divided by
sum of squares of residuals.
• DW ≈ 2 * (1 – r), where r = correlation between
current residual and previous residual
• DW ≈ 2 indicates errors are serially uncorrelated
• 1.5  DW  2.5 may be acceptable (range of DW
is 0 to 4)
39
Durbin Watson (DW) Statistic
σ𝑛
𝑖=2 𝑒𝑖 −𝑒𝑖−1
2
• 𝐷𝑊 = σ𝑛 2
𝑖=1 𝑒𝑖
σ𝑛
𝑖=2 𝑒𝑖
2 σ𝑛
𝑖=2 𝑒𝑖−1
2 σ𝑛
𝑖=2 2𝑒𝑖 𝑒𝑖−1
= σ𝑛 2 + σ𝑛 2 − σ𝑛 2
𝑖=1 𝑒𝑖 𝑖=1 𝑒𝑖 𝑖=1 𝑒𝑖

σ𝑛
𝑖=2 𝑒𝑖 −𝑒ҧ (𝑒𝑖−1 −𝑒)
ҧ
≈ 1+1−2
σ𝑛
𝑖=1 𝑒𝑖 −𝑒ҧ 2 σ𝑛
𝑖=1 𝑒𝑖−1 −𝑒ҧ
2

≈ 2(1 − 𝐶𝑜𝑟𝑟 𝑒𝑖 , 𝑒𝑖−1 )


• Hence, DW = sum of squares of successive differences in
residuals divided by sum of squares of residuals  2(1 
correlation between successive regression errors ei and ei-1)

40
Random Pattern, as shown in graph below,
indicates Regression Errors “Independent”

Horizontal Rectangular pattern  ‘Random’ pattern


41
How Good is the Regression Fit?
• Total variation in y-values divided in two parts:

SST  SSR  SSE


Total Sum Regression Error Sum of
of Squares Sum of Squares
Squares
SST   ( Yi  Y)2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2
where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 42
Simple Linear Regression: Assumptions
• The relationship between X
Y
Assumptions of the Simple
Linear Regression Model
and Y is a straight-line
relationship.
• The values of the
independent variable X are
assumed fixed (not random);
E[Y]=0 + 1 X
the only randomness in the
values of Y comes from the
error term i.
• The errors ei are normally
distributed with mean 0 and
variance 2. The errors are at Identical normal
distributions of
least uncorrelated (not errors, all centered
on the regression
related) in successive line.
observations. That is: ei ~
N(0,2) X

Suppose Y=Sales and X=AE (Adv Exp); we assume that for a given AE value the
conditional distribution of possible sales values is normal and expected (average)
Sales change as a linear function of AE but the variation in Sales remain unchanged.
Estimating Error Variance, 2, in
Simple Linear Regression: Yi = a + b*Xi

Building blocks of SSE are the residuals ei = (𝑌𝑖 − 𝑌෠𝑖 ), which satisfy two
restrictions; thus only (n-2) of the residuals can take values freely but the
remaining two can not. Hence the ‘degrees of freedom’ is (n-2) and we
divide by (n-2) to calculate ‘average’ of the squared errors in the definition
MSE to estimate 2. 44
How good is SLR of Net Sales on Adv Exp ?

• Trent Data: Net Sales = 27.82 + 7.59 (Adv Exp);


R2=0.61, Adj R2=0.58, DW=1.33)

MAPE = Mean Absolute Percentage Error


= (1/4)(17.29+8.38+16.90+52.75) = 23.75
45
Example 2 (Trent data): Personnel Cost in place of Adv
Exp. Predict sales for Jun-07 if (approx) 9.19 crores are
earmarked for Personnel Cost
Quarter Advertising expenses
Personnel cost
Net sales
Dec-03 4.3 2.74 42.02
Mar-04 2.7 3.27 40.35
Jun-04 4.59 3.3 46.89
Sep-04 4.54 3.41 54.1
Dec-04 8.87 3.85 70.13
Mar-05 3.17 4.1 63.37
Jun-05 7.44 4.14 74.01
Sep-05 8.33 4.89 86.63
Dec-05 9 5.22 95.17
Mar-06 5.08 6.02 90.63
Jun-06 11.18 6.64 105.09
Sep-06 10.18 7.5 120.29
Dec-06 11.03 7.11 121.83
Mar-07 5.67 7.19 108.58

Jun-07 9.58 9.19 121.6


46
Estimated Reg line (Trent data):
Net Sales = - 2.007+16.53 (Person. Cost).
June 2007 quarter: predicted net sales= - 2.007+16.53
(9.19) =149.94 crore (Actual = 121.6) ;
(Adj R2 = 0.93, DW=1.67)

47
Multiple Regression Equation With Two
Independent Variables
Two variable model
Y
Ŷ  b0  b1X1  b 2 X 2

X2

X1
Residuals in Multiple Regression

Two variable model


Y Sample
Yi observation Ŷ  b0  b1X1  b 2 X 2
Residual =
<

ei = (Yi – Yi)
<

Yi

x2i
X2

x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2.
Multiple Linear Regression (MLR)

• Data on (X1, X2, …, Xk, Y):

• Y=a + b1X1 + b2X2+ …+ bkXk + error


=a*1 + b1*X1 + b2*X2+ …+ bk*Xk + error

n
Minimize   y i  a  b1x1i  b 2 x 2i  ...  b k x ki 
2

i 1

50
“Equations” to solve for estimating MLR

SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki (1)  0
a i 1

SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki ( x1i )  0
b1 i 1

………

SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki ( xki )  0
bk i 1

51
Short-cut method for writing Equations to
solve for estimating MLR
• Model: Y=a*1 + b1*X1 + b2*X2+ …+ bk*Xk + error

• There are (k+1) explanatory variables: 1 (constant) and k


nonconstant X-variables.
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by 1 throughout,
then sum over i to get 1st equation;
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by x1i throughout,
then sum over i to get 2nd equation;
………
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by xki throughout,
then sum over i to get (k+1)-th equation.
52
For k=2, the equations are:

na  ( x1i )b1  ( x2i )b2   yi


( x1i )a  ( x12i )b1  ( x2i x1i )b2   yi x1i
( x2i )a  ( x1i x2i )b1  ( x22i )b2   yi x 2 i

• Given data, calculate yi, x1i, x2i,, yix1i, x1i2, x2ix1i,


yix2i, x2i2 and solve the 3 equations in 3 unknowns
• Excel (or any software like PHStat) solves these equations

53
Example (Trent Data):

54
Example (Trent): MLR Net Sales = -4.93 + 2.55 (Adv Exp)
+ 13.60 (Person. Cost), (Adj R2=0.97, DW=1.1). For June
2007, Predicted value = 144.42 crore (Actual = 121.6)

MLR-AE and PC (DW=1.1)

10.00

5.00
Res

0.00
40.00 60.00 80.00 100.00 120.00 140.00
-5.00

-10.00
Pred

55
Example (Trent Data): Try MLR of Net Sales on
Personnel Cost, PC^2 and Adv Exp. (Adj R2=.978,
DW=1.64, Pred=129) (Actual = 121.6)

MLR- PC, PC^2, AE (DW=1.64)


6
4
2
0

Res
-2 30 50 70 90 110 130
-4
-6
-8
-10
Pred

56
Summary so far …

57
Regression using
“Dummy Predictors”
(used for a Qualitative Predictor Variable)

• Examples: Gender, Location, Season, Quarter, Day of


Week, Hour of Day, Month of Year

• No. of dummy or “0-1” or binary or indicator variables


introduced
= (no. of possible values of Qualitative Variable) 1

58
Trent Data: Line plot of Net Sales over Time

Net sales over time


130
120
110
100
90
80
70
60
50
40
Dec-04

Dec-05

Dec-06
Sep-04

Mar-05

Jun-05

Sep-05

Mar-06

Sep-06

Mar-07
Jun-04

Jun-06

59
60
DVR Results (Trent Data)
Net sales = 30.66 + 6.57*Time; (Adj R2 = 0.94,
DW=1.86)

Net Sales = 6.65*Time + 28.76(if Q1) + 33.78 (if


Q2) + 35.71(If Q3) + 22.5(If Q4)
(Adj R2 = 0.98, DW=1.48)

61
Polynomial Regression
• Data: (x1,y1), (x2,y2), …, (xn, yn)

• Y=a + b1X + b2X2+ …+ bkXk + error

• Minimize
 
n
SSE   y i  a  b1x i  b 2 x  ...  b k x
2
i
k 2
i
i 1

62
When Polynomial Regression ??

Y Y

x x

residuals
residuals

x x

Not Linear
 Linear 63
Fitting Polynomials (in X) to Data
(Y=a + b1X + b2X2+ …+ bkXk + error)

 Consider polynomial regression for k = 2, 3, 4


etc.
 At what value of k do we stop? [until Adj R2
starts decreasing]

64
Remember SLR on PC (Trent data): Net Sales = -
2.007+16.53 (Person. Cost); for next quarter, predicted net
sales= - 2.007+16.53 (9.19) =149.94 crore (Actual = 121.6)

SLR-PC(DW=1.67)

15.00

10.00

5.00
Res

0.00
40.00 60.00 80.00 100.00 120.00 140.00
-5.00

-10.00

-15.00
Pred

65
Try Quadratic Reg in PC (Trent data): Net Sales = -
44.12+34.47 (PC) - 1.73(PC2); for next quarter, predicted net
sales = - 44.12 +34.47 (9.19) - 1.73(9.19)2 =126.93 crore
(Actual = 121.6) Adj R2=0.94

Quad Reg in PC(DW=2.23)

10

0
Res

30 50 70 90 110 130
-5

-10

-15
Pred

66
Summary

R2 = proportion of variation in response variable explained by explanatory


variables; Adj R2 = Adjusted R2 is an improved version of R2; DW= Durbin-
Watson statistic to quantify ‘degree of linear relationship’ among regression
errors over time; MAPE = Mean (average) Absolute Percentage Error, where
|(𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 −𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)| 67
APE(absolute percentage error) = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
× 100
Student’s t Distribution
(by William Sealy Gosset, pseudonym “Student”)

Z ~ N(0,1), Y ~ 2m chi-square r.v. with m df, Z and Y indep.


Then has t distribution with df=m

Standard
Normal
(t with df = )

t-distributions are bell-


shaped and symmetric, but
have ‘fatter’ tails than the
normal

0
68
F Distribution
Let Y1 and Y2 be independent r.v.s having 2 distributions
with df=m1 and m2 respectively. Then (Y1/m1)/(Y2/m2) has F
dist with df=(m1,m2)

69

You might also like