Correlation Regression - 2023

Correlation & Regression Analysis
Sahadeb Sarkar
Operations Management Group, IIM Calcutta
Readings
Regression Based Business Forecasting:
Covariance & Correlation Coefficient (Sec 3.5,
5.2), Simple Linear Regression (Sec 13.1-13.6),
Multiple Linear Regression (Sec 14.1, 14.2, 14.6
(Dummy Var Reg), 15.1 (Polynomial Reg), 15.2)
Text book: “Statistics for Managers using

Microsoft Excel” Levine, Stephan & Szabat, 8th
ed.
Regression: Modeling “Effect” Y using “Cause(s)” or
Proxy for Cause(s) X
Effect (Y) and “Cause” (X):
• (Market Share, Size of Quality Sales Force)
• (Sales, Adv Exp)
• (Amount of sales, No. of customers visiting store)
• (Closing Sensex next day, Closing DJIA (Dow Jones Industrial
Average) Index)
• (Performance in Competitive Exam and I.Q.)
• (Gold Price and Rate of Inflation)
• (Percentage of Defectives in the output and Speed of Conveyor
Belt)
Note: Y: known as “Effect/Response/Target/Dependent” variable
X: known as “Predictor/Explanatory/Independent/’Cause’/Proxy for
3
‘Cause’ ” variable
“Multiple Causes-and-Effect” Model
• Sales of a company; Exp on Adv (TV, Radio, Print

Media), Personnel Cost
• Sales of a store; Promotional Expenses, Discounts
• Performance evaluation; Aptitude, Job experience, MBA
or not
• Car mileage; Horse power, Weight, Weather
(temperature), Driving behavior
4
Simple Linear Regression Model
Y Yi  a  bX i  ei
Observed Value
of Y for Xi
ei Slope = b
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = a
Xi X
Simple Linear Regression: Assumptions
• The relationship between X
Y
Assumptions of the Simple
Linear Regression Model
and Y is a straight-line
relationship.
• The values of the
independent variable X are
assumed fixed (not random);
E[Y]=0 + 1 X
the only randomness in the
values of Y comes from the
error term i.
• The errors ei are normally
distributed with mean 0 and
variance 2. The errors are at Identical normal
distributions of
least uncorrelated (not errors, all centered
on the regression
related) in successive line.
observations. That is: ei ~
N(0,2) X
Suppose Y=Sales and X=AE (Adv Exp); we assume that for a given AE value the
conditional distribution of possible sales values is normal and expected (average)
Sales change as a linear function of AE but the variation in Sales remain unchanged.
Trent
Trent is a retail operations company that owns and
manages a number of retail chains in India. Established in
1998, Trent runs lifestyle chain Westside, one of India’s
largest and fastest growing chain of lifestyle retail stores,
Star Bazaar, a hypermarket chain, Landmark, a books and
music chain, and Fashion Yatra, a complete family fashion
store.
7
Example: Data on Quarterly Advertising Expense (X) and
Quarterly Net Sales (Y) for Trent from Dec 2003 to Mar 2008.
Hold-Out or
Validation Sample,
set aside to check 8
prediction ability
Different Regression Models
Simple Linear Regression:
(1) Sales = a + b*AE, AE= Adv Exp
(2) Sales = a + b*PC, PC= Personnel Cost
Multiple Linear Regression:
(3) Sales = a + b1*AE + b2*PC
[Note 1: Adding (AE+PC) as a new predictor in the above model is not useful]
Polynomial Regression:
(4a) Sales = a + b1*PC + b2*(PC)2
(4b) Sales = a + b1*AE + b2*(AE)2
Dummy Variable Regression:
(5) Sales = a*1 + b*Time + c1*Q1+ c2*Q2+ c3*Q3, where Q1, Q2,Q3 are
indicator variables for quarters 1,2,3.
[ Note 2: a= a*1; Q1+Q2+Q3+Q4=1  Q4=1(Q1+Q2+Q3); adding Q4 in the above
model as a predictor is not useful when ‘constant term a’ and Q1, Q2,Q3 are present.]
9
Summary
R2 = proportion of variation in response variable explained by explanatory

variables; Adj R2 = Adjusted R2 is an improved version of R2; DW= Durbin-
Watson statistic to quantify ‘degree of linear relationship’ among regression
errors over time; MAPE = Mean (average) Absolute Percentage Error, where
|(𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 −𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)| 10
APE(absolute percentage error) = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
× 100
Simple Linear Reg: Sales = a + b*AE
Data: (x1,y1), …, (xn,yn)

1. Is there a linear relationship between variables Net
Sales and Adv Exp ? [construct scatter plot of Y vs X]
2. If a linear relationship exists, how strong is it ? [calculate
correlation coefficient between Y and X]
3. If linear relationship strong, give an estimated linear
relationship of Net Sales on Adv Exp based on the
data [estimate a, b coefficients through ‘least squares’ of errors
method]
4. How good is the fit? [calculate R2 = proportion of variation in Y
values explained by the variation in X values through the linear
regression model]
11
130 12
120 Net sales 11
110 Adv Exp 10
100 9
Net Sales
Adv Exp
90 8
80 7
70 6
60 5
50 4
40 3
30 2
04
05
06
3
6
-0
-0
-0
-0
n-
n-
n-
ec
ec
ec
ec
Ju
Ju
Ju
D
12
Trent data: Correlation = 0.78
140
120
100
Net Sales
80
60
40
20
2 4 6 8 10 12
Adv Exp
13
Correlation = 0.97
140
120
100
Net Sales
80
60
40
20
2 3 4 Personnel
5 Cost 6 7 8
14
Measures of linear relationship:
(Sample) Covariance
1
Definition: Covariance = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1
1 (σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Shortcut Formula: σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛−1 𝑛
Note: When Y = X, then Covariance(X,Y) is Variance of X
Shortcut Formula for sample variance:

𝑛 𝑛 2
1 2
1 2
(σ𝑛𝑖=1 𝑥𝑖 )
෍ 𝑥𝑖 − 𝑥ҧ = ෍ 𝑥𝑖 −
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛
2
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑥𝑖 ) (σ𝑛 𝑥)
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1 𝑖
𝑛 𝑛
15
(Sample) Covariance
1
Covariance = σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑛−1
 Covariance positive: If usually above (below) average X values

occur when above (below) average Y values happen
 Covariance depends on units of X and Y. [Relation between GDP
and Inflation, and that between Price and Sales cannot be
compared with covariance]
 Its value ranges between – and ; no finite benchmark as to
how large is “large”
16
Measure of linear relationship: (Sample) Correlation
1
σ𝑛 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത Covariance(x,y)
𝑛−1 𝑖=1
𝑟= =
1 1 SD x SD(y)
σ𝑛 𝑥𝑖 −𝑥ҧ 2 σ𝑛 𝑦𝑖 −𝑦ത 2
𝑛−1 𝑖=1 𝑛−1 𝑖=1
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑛
2
(σ𝑛 𝑛
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1𝑛 𝑖
𝑛
17
Measure of linear relationship: (Sample) Correlation
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2 σ𝑛𝑖=1 𝑦𝑖 − 𝑦ത 2
(σ𝑛 𝑛
𝑖=1 𝑥𝑖 )(σ𝑖=1 𝑦𝑖 )
Note1: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛
2
(σ𝑛 𝑛
Note2: σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
= σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑥𝑖 − 𝑥ҧ =σ𝑛𝑖=1 𝑥𝑖 𝑥𝑖 − = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑖=1𝑛 𝑖
𝑛
Hence, Short-cut Formula for Correlation:

(σ𝑛 𝑥𝑖 )(σ𝑛
𝑖=1 𝑦𝑖 )
σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑖=1
𝑛
𝑟=
(σ𝑛 𝑥 )2 (σ𝑛 𝑦 )2
σ𝑛 2 𝑖=1 𝑖 σ𝑛 2 𝑖=1 𝑖
𝑖=1 𝑥𝑖 − 𝑛 𝑖=1 𝑦𝑖 − 𝑛
18
(Sample) Correlation
 Correlation is unit-free; (Relation between GDP and Inflation,

and that between Price and Sales can be compared with
correlation)
 Its value between –1 and 1; has finite benchmark as to how large
is “large”
19
Calculation of r
Note: For calculation of variance, covariance, correlation

and almost all of the regression related statistics we do
not require raw data {(xi,yi)}, just need to know:
n, σ𝑛𝑖=1 𝑥𝑖 , σ𝑛𝑖=1 𝑥𝑖 2 , σ𝑛𝑖=1 𝑦𝑖 , σ𝑛𝑖=1 𝑦𝑖 2 , σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖
20
Correlation Analysis for Trent Data
Note: Going by the covariance as a measure of linear relationship can be

misleading as seen from this example. Here, covariance suggests Adv Exp as a
better linear predictor of sales than Personnel Cost; but common sense and the
21
scatterplots lead us to believe otherwise.
Scatter plots showing positive and
negative correlation
22
(Nearly) Zero Correlation
23
Interpretation of Correlation values
• Zero correlation does not mean no relationship [there
may be nonlinear relationship; correlation measures linear
relationship]
• Similarly, high correlation does not mean causal

relationship between variables [there could be hidden
variable(s) influencing both Y and X variables]
Example 1: X = No. of firemen fighting a fire, Y = damage from fire

(correlation usually positive, Hidden Factor: Severity of fire)
Example 2: X = Computer Sales, Y= Demand for Dental Care,
(Correlation usually positive, Hidden Factor: GDP growth)
24
Example 1: Quarterly Advertising Expense (X) and
Quarterly Net Sales (Y) for Trent from Dec-03 to Mar-07.
Predict sales for Jun-07 quarter if (approx) 9.58 crores
earmarked for Adv Exp for Jun-07.
25
Simple linear regression
• Model : Yi= (a + bXi) + ei, i=1,2,…,n

• “error” ei = part of Yi unexplained
– true relation between Y and X may be nonlinear
– Y depends not just on X but on other
unincluded variables
Example: Y=sales, X=adv. exp., (other explanatory
variables: personnel cost, price, GDP growth)
Least Squares Method: sum of ei2 over all time periods
26
Minimize SSE = i(yi-a-bxi)2, w.r.t. a, b
Simple Linear Regression Model: Yi= (a*1 + b*Xi) + ei.
There are two explanatory variables: 1 (constant) and X (nonconstant).

• Multiply yi= (a*1 + b*xi) by 1 throughout, then sum over i to get
equation (1) below;
• Multiply yi= (a*1 + b*xi) by xi throughout, then sum over i to get
equation (2) below. n n
na  b xi   yi (1)
i 1 i 1
n n n
a  xi  b xi   xi yi
2
(2)
i 1 i 1 i 1
i.e., in matrix notation we have

 n
  n 
 n  x i 
 a    i  y
 i 1
     i 1 
 n n
2  b   n

  xi  xi   i i 
x y
 i 1 i 1   i 1  27
Estimates of b, a
n
 ( x  x )(y
i i  y)
 sy 
b i 1
n
 ry , x  , where s y and s x are
 sx 
 i
( x
i 1
 x ) 2
sample standard deviations of y and x variabl es.
a  y b x
Predicted Value of y  a  bx  y  b (x  x )
Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො

(y-hat); for X=xi we write the predicted value of y as 𝑦ො𝑖
28
Short-Cut Formual: Estimates of b, a
a  y b x
Predicted Value of y  a  bx  y  b (x  x )
Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො

(y-hat); for X=xi we write the predicted value of y as 𝑦ො𝑖
29
Estimated Reg line (Trent data): Net Sales = 27.82 + 7.59
(Adv Exp);
(i) Interpretation of 𝑏෠ (b-hat), 𝑎ො (a-hat)??
(ii) For next quarter Jun-07, predicted net sales= 27.82 +
7.59*(9.58) = 100.56 crore (actual sales = 121.6)
30
Y
relationship.
E[Y]=0 + 1 X
error term i.
distributions of
on the regression
N(0,2) X
How Good is the Regression Fit?
• Total variation in y-values divided in two parts:
SST  SSR  SSE

Total Sum Regression Error Sum of
of Squares Sum of Squares
Squares
SST   ( Yi  Y)2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2
where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 32
Simple Linear Regression Model
Y Yi  a  bX i  ei
Observed Value
of Y for Xi
ei Slope = b
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = a
Xi X
Measures of Variation
Y
Yi  
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2

Yi  _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
R2 = proportion of variation in Y explained by variation
in explanatory variable(s) through regression relation
𝑆𝑆𝑅 𝑆𝑆𝐸
Definition: 𝑅2 = =1−( )
𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝐸/(𝑛−𝑘−1)
Definition: 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1−
𝑆𝑆𝑇/(𝑛−1)
where k = number of non-constant predictors and ‘1’ refers to
presence of a constant predictor;
Note: R2 involves ratio of SSE (total amount of variation in errors) to TSS
(total amount of variation in response Y values);
Adjusted R2 involves ratio of SSE/(n-k-1) (i.e., ‘average’ amount of
variation in errors) to SST/(n-1) (i.e., ‘average’ amount of variation in
response Y values).
(𝑛−1)
Verify: 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − (1 − 𝑅2 ) ;
(𝑛−𝑘−1)
35
R2 (‘Coefficient of Determination’) & Adjusted R2
R2 (Coefficient of Determination): proportion of

variation in Y explained by variation in explanatory
variable(s) through regression relation
Adjusted R2: Modified version of R2 which penalizes a

model for including redundant explanatory variables
and takes into account sample size
(i) 0  R2  1; (ii) Adjusted R2  R2 , (iii) Adjusted R2

may be negative. 36
Are Regression Errors “independent” ?
(Trent Data: SLR of Net Sales on Adv Exp)
SLR-AE (DW=1.33)
50
40
30
20
Res
10
0
-10 40 50 60 70 80 90 100 110 120
-20
-30
Pred
37
Y
relationship.
E[Y]=0 + 1 X
error term i.
distributions of
on the regression
N(0,2) X
Durbin Watson (DW) Statistic
• DW =
sum of squares of (residual – previous residual),
divided by
sum of squares of residuals.
• DW ≈ 2 * (1 – r), where r = correlation between
current residual and previous residual
• DW ≈ 2 indicates errors are serially uncorrelated
• 1.5  DW  2.5 may be acceptable (range of DW
is 0 to 4)
39
Durbin Watson (DW) Statistic
σ𝑛
𝑖=2 𝑒𝑖 −𝑒𝑖−1
2
• 𝐷𝑊 = σ𝑛 2
𝑖=1 𝑒𝑖
σ𝑛
𝑖=2 𝑒𝑖
2 σ𝑛
𝑖=2 𝑒𝑖−1
2 σ𝑛
𝑖=2 2𝑒𝑖 𝑒𝑖−1
= σ𝑛 2 + σ𝑛 2 − σ𝑛 2
𝑖=1 𝑒𝑖 𝑖=1 𝑒𝑖 𝑖=1 𝑒𝑖
σ𝑛
𝑖=2 𝑒𝑖 −𝑒ҧ (𝑒𝑖−1 −𝑒)
ҧ
≈ 1+1−2
σ𝑛
𝑖=1 𝑒𝑖 −𝑒ҧ 2 σ𝑛
𝑖=1 𝑒𝑖−1 −𝑒ҧ
2
≈ 2(1 − 𝐶𝑜𝑟𝑟 𝑒𝑖 , 𝑒𝑖−1 )

• Hence, DW = sum of squares of successive differences in
residuals divided by sum of squares of residuals  2(1 
correlation between successive regression errors ei and ei-1)
40
Random Pattern, as shown in graph below,
indicates Regression Errors “Independent”
Horizontal Rectangular pattern  ‘Random’ pattern

41
How Good is the Regression Fit?
• Total variation in y-values divided in two parts:
SST  SSR  SSE

Total Sum Regression Error Sum of
of Squares Sum of Squares
Squares
SST   ( Yi  Y)2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2
where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷi = Predicted value of Y for the given Xi value 42
Y
relationship.
E[Y]=0 + 1 X
error term i.
distributions of
on the regression
N(0,2) X
Estimating Error Variance, 2, in
Simple Linear Regression: Yi = a + b*Xi
Building blocks of SSE are the residuals ei = (𝑌𝑖 − 𝑌෠𝑖 ), which satisfy two
restrictions; thus only (n-2) of the residuals can take values freely but the
remaining two can not. Hence the ‘degrees of freedom’ is (n-2) and we
divide by (n-2) to calculate ‘average’ of the squared errors in the definition
MSE to estimate 2. 44
How good is SLR of Net Sales on Adv Exp ?
• Trent Data: Net Sales = 27.82 + 7.59 (Adv Exp);

R2=0.61, Adj R2=0.58, DW=1.33)
MAPE = Mean Absolute Percentage Error

= (1/4)(17.29+8.38+16.90+52.75) = 23.75
45
Example 2 (Trent data): Personnel Cost in place of Adv
Exp. Predict sales for Jun-07 if (approx) 9.19 crores are
earmarked for Personnel Cost
Quarter Advertising expenses
Personnel cost
Net sales
Dec-03 4.3 2.74 42.02
Mar-04 2.7 3.27 40.35
Jun-04 4.59 3.3 46.89
Sep-04 4.54 3.41 54.1
Dec-04 8.87 3.85 70.13
Mar-05 3.17 4.1 63.37
Jun-05 7.44 4.14 74.01
Sep-05 8.33 4.89 86.63
Dec-05 9 5.22 95.17
Mar-06 5.08 6.02 90.63
Jun-06 11.18 6.64 105.09
Sep-06 10.18 7.5 120.29
Dec-06 11.03 7.11 121.83
Mar-07 5.67 7.19 108.58
Jun-07 9.58 9.19 121.6

46
Estimated Reg line (Trent data):
Net Sales = - 2.007+16.53 (Person. Cost).
June 2007 quarter: predicted net sales= - 2.007+16.53
(9.19) =149.94 crore (Actual = 121.6) ;
(Adj R2 = 0.93, DW=1.67)
47
Multiple Regression Equation With Two
Independent Variables
Two variable model
Y
Ŷ  b0  b1X1  b 2 X 2
X2
X1
Residuals in Multiple Regression
Two variable model

Y Sample
Yi observation Ŷ  b0  b1X1  b 2 X 2
Residual =
<
ei = (Yi – Yi)
<
Yi
x2i
X2
x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2.
Multiple Linear Regression (MLR)
• Data on (X1, X2, …, Xk, Y):
• Y=a + b1X1 + b2X2+ …+ bkXk + error

=a*1 + b1*X1 + b2*X2+ …+ bk*Xk + error
n
Minimize   y i  a  b1x1i  b 2 x 2i  ...  b k x ki 
2
i 1
50
“Equations” to solve for estimating MLR
SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki (1)  0
a i 1
SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki ( x1i )  0
b1 i 1
………
SSE n
  2 yi  a  b1 x1i  b2 x2i  ...  bk xki ( xki )  0
bk i 1
51
Short-cut method for writing Equations to
solve for estimating MLR
• Model: Y=a*1 + b1*X1 + b2*X2+ …+ bk*Xk + error
• There are (k+1) explanatory variables: 1 (constant) and k

nonconstant X-variables.
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by 1 throughout,
then sum over i to get 1st equation;
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by x1i throughout,
then sum over i to get 2nd equation;
………
• Multiply yi= (a*1 + b1*x1i +…+ bk*xki ) by xki throughout,
then sum over i to get (k+1)-th equation.
52
For k=2, the equations are:
na  ( x1i )b1  ( x2i )b2   yi

( x1i )a  ( x12i )b1  ( x2i x1i )b2   yi x1i
( x2i )a  ( x1i x2i )b1  ( x22i )b2   yi x 2 i
• Given data, calculate yi, x1i, x2i,, yix1i, x1i2, x2ix1i,

yix2i, x2i2 and solve the 3 equations in 3 unknowns
• Excel (or any software like PHStat) solves these equations
53
Example (Trent Data):
54
Example (Trent): MLR Net Sales = -4.93 + 2.55 (Adv Exp)
+ 13.60 (Person. Cost), (Adj R2=0.97, DW=1.1). For June
2007, Predicted value = 144.42 crore (Actual = 121.6)
MLR-AE and PC (DW=1.1)
10.00
5.00
Res
0.00
40.00 60.00 80.00 100.00 120.00 140.00
-5.00
-10.00
Pred
55
Example (Trent Data): Try MLR of Net Sales on
Personnel Cost, PC^2 and Adv Exp. (Adj R2=.978,
DW=1.64, Pred=129) (Actual = 121.6)
MLR- PC, PC^2, AE (DW=1.64)

6
4
2
0
Res
-2 30 50 70 90 110 130
-4
-6
-8
-10
Pred
56
Summary so far …
57
Regression using
“Dummy Predictors”
(used for a Qualitative Predictor Variable)
• Examples: Gender, Location, Season, Quarter, Day of

Week, Hour of Day, Month of Year
• No. of dummy or “0-1” or binary or indicator variables

introduced
= (no. of possible values of Qualitative Variable) 1
58
Trent Data: Line plot of Net Sales over Time
Net sales over time

130
120
110
100
90
80
70
60
50
40
Dec-04
Dec-05
Dec-06
Sep-04
Mar-05
Jun-05
Sep-05
Mar-06
Sep-06
Mar-07
Jun-04
Jun-06
59
60
DVR Results (Trent Data)
Net sales = 30.66 + 6.57*Time; (Adj R2 = 0.94,
DW=1.86)
Net Sales = 6.65*Time + 28.76(if Q1) + 33.78 (if

Q2) + 35.71(If Q3) + 22.5(If Q4)
(Adj R2 = 0.98, DW=1.48)
61
Polynomial Regression
• Data: (x1,y1), (x2,y2), …, (xn, yn)
• Y=a + b1X + b2X2+ …+ bkXk + error
• Minimize
 
n
SSE   y i  a  b1x i  b 2 x  ...  b k x
2
i
k 2
i
i 1
62
When Polynomial Regression ??
Y Y
x x
residuals
residuals
x x
Not Linear
 Linear 63
Fitting Polynomials (in X) to Data
(Y=a + b1X + b2X2+ …+ bkXk + error)
 Consider polynomial regression for k = 2, 3, 4

etc.
 At what value of k do we stop? [until Adj R2
starts decreasing]
64
Remember SLR on PC (Trent data): Net Sales = -
2.007+16.53 (Person. Cost); for next quarter, predicted net
sales= - 2.007+16.53 (9.19) =149.94 crore (Actual = 121.6)
SLR-PC(DW=1.67)
15.00
10.00
5.00
Res
0.00
40.00 60.00 80.00 100.00 120.00 140.00
-5.00
-10.00
-15.00
Pred
65
Try Quadratic Reg in PC (Trent data): Net Sales = -
44.12+34.47 (PC) - 1.73(PC2); for next quarter, predicted net
sales = - 44.12 +34.47 (9.19) - 1.73(9.19)2 =126.93 crore
(Actual = 121.6) Adj R2=0.94
Quad Reg in PC(DW=2.23)
10
0
Res
30 50 70 90 110 130
-5
-10
-15
Pred
66
Summary
R2 = proportion of variation in response variable explained by explanatory

variables; Adj R2 = Adjusted R2 is an improved version of R2; DW= Durbin-
Watson statistic to quantify ‘degree of linear relationship’ among regression
errors over time; MAPE = Mean (average) Absolute Percentage Error, where
|(𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 −𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)| 67
APE(absolute percentage error) = 𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
× 100
Student’s t Distribution
(by William Sealy Gosset, pseudonym “Student”)
Z ~ N(0,1), Y ~ 2m chi-square r.v. with m df, Z and Y indep.

Then has t distribution with df=m
Standard
Normal
(t with df = )
t-distributions are bell-

shaped and symmetric, but
have ‘fatter’ tails than the
normal
0
68
F Distribution
Let Y1 and Y2 be independent r.v.s having 2 distributions
with df=m1 and m2 respectively. Then (Y1/m1)/(Y2/m2) has F
dist with df=(m1,m2)
69

Correlation Regression - 2023

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Correlation Regression - 2023

Uploaded by

Copyright:

Available Formats

Correlation & Regression Analysis

Text book: “Statistics for Managers using

• Sales of a company; Exp on Adv (TV, Radio, Print

R2 = proportion of variation in response variable explained by explanatory

Data: (x1,y1), …, (xn,yn)

Note: When Y = X, then Covariance(X,Y) is Variance of X

Shortcut Formula for sample variance:

 Covariance positive: If usually above (below) average X values

Hence, Short-cut Formula for Correlation:

 Correlation is unit-free; (Relation between GDP and Inflation,

Note: For calculation of variance, covariance, correlation

Note: Going by the covariance as a measure of linear relationship can be

• Similarly, high correlation does not mean causal

Example 1: X = No. of firemen fighting a fire, Y = damage from fire

• Model : Yi= (a + bXi) + ei, i=1,2,…,n

There are two explanatory variables: 1 (constant) and X (nonconstant).

i.e., in matrix notation we have

sample standard deviations of y and x variabl es.

Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො

Note: We write the estimates as 𝑎ො (a-hat), 𝑏෠ (b-hat), and 𝑦ො

SST  SSR  SSE

R2 (Coefficient of Determination): proportion of

Adjusted R2: Modified version of R2 which penalizes a

(i) 0  R2  1; (ii) Adjusted R2  R2 , (iii) Adjusted R2

≈ 2(1 − 𝐶𝑜𝑟𝑟 𝑒𝑖 , 𝑒𝑖−1 )

Horizontal Rectangular pattern  ‘Random’ pattern

SST  SSR  SSE

• Trent Data: Net Sales = 27.82 + 7.59 (Adv Exp);

MAPE = Mean Absolute Percentage Error

Jun-07 9.58 9.19 121.6

Two variable model

• Data on (X1, X2, …, Xk, Y):

• Y=a + b1X1 + b2X2+ …+ bkXk + error

• There are (k+1) explanatory variables: 1 (constant) and k

na  ( x1i )b1  ( x2i )b2   yi

• Given data, calculate yi, x1i, x2i,, yix1i, x1i2, x2ix1i,

MLR-AE and PC (DW=1.1)

MLR- PC, PC^2, AE (DW=1.64)

• Examples: Gender, Location, Season, Quarter, Day of

• No. of dummy or “0-1” or binary or indicator variables

Net sales over time

Net Sales = 6.65*Time + 28.76(if Q1) + 33.78 (if

• Y=a + b1X + b2X2+ …+ bkXk + error

 Consider polynomial regression for k = 2, 3, 4

Quad Reg in PC(DW=2.23)

R2 = proportion of variation in response variable explained by explanatory

Z ~ N(0,1), Y ~ 2m chi-square r.v. with m df, Z and Y indep.

t-distributions are bell-

You might also like