You are on page 1of 36

Advanced Programme in FinTech and

Financial Blockchain

Data Analysis and Interpretation


(Regression)

Prof. Saibal Chattopadhyay


IIM Calcutta
Regression
• Regression refers to the statistical technique of modeling the
relationship between variables.
• Simple linear regression: we model the relationship between
two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the independent
variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
Scatterplot
This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales 14 0

on the y-axis. We notice that: 12 0

10 0

S ale s
80
 Larger (smaller) values of sales tend to be 60
associated with larger (smaller) values of 40

advertising. 20

0
0 10 20 30 40 50
A d ve rti s i ng

 The scatter of points tends to be distributed around a positively sloped straight line.

 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
 The line represents the nature of the relationship on average.
Examples of Other Scatterplots

Y
Y
Y

X 0 X X
Y

Y
X X X
Model Building
The inexact nature of the Data In ANOVA, the systematic
relationship between component is the variation
advertising and sales of means between samples
suggests that a statistical or treatments (SSTR) and
model might be useful in Statistical the random component is
analyzing the relationship. model the unexplained variation
(SSE).
A statistical model separates
the systematic component In regression, the
of a relationship from the Systematic systematic component is the
random component. component overall linear relationship,
and the random component
+ is the variation around the
Random line.
errors
The Simple Linear Regression
Model
The population simple linear regression model:

Y=  0 + 1 X + 
Non-random/Systematic Part Random Error
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.

 0 is the intercept of the systematic component of the regression relationship.


 1 is the slope of the systematic component.

The conditional mean of Y: E[Y X ]   0   1 X


Picturing the Simple Linear
Regression Model
Y
The simple linear regression
Regression Plot model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X, the
E[Y]=0 + 1 X
independent or predictor variable:
Yi

{
Error: i } 1 = Slope
E[Yi]=0 + 1 Xi
}

1
Actual observed values of Y
0 = Intercept
differ from the expected value by
an unexplained or random error:

X Yi = E[Yi] + i
Xi
= 0 + 1 Xi + i
Assumptions of SLR Model
• The relationship between
X and Y is a straight-line
relationship. Assumptions of the Simple

• The values of the


Y Linear Regression Model

independent variable X are


assumed fixed (not
random); the only
randomness in the values E[Y]=0 + 1 X

of Y comes from the error


term i.
• The errors i are normally
distributed with mean 0
Identical normal
and variance 2. The distributions of errors,
all centered on the
errors are uncorrelated regression line.

(not related) in successive


observations. That is: ~ X
N(0,2)
Estimation: Method of Least Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.

The estimated regression equation:

Y = b0 + b1X + e

where b0 estimates the intercept of the population regression line, 0 ;


b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:

Yˆ  b + b X
0 1
where Ŷ (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
Errors in Regression
Y

Yi . Yˆ  b  b X
0 1
the fitted regression line

Yi
{
Error ei  Yi  Yˆi
Yi the predicted value of Y for X
i

X
Xi
Least Squares Regression
The sum of squared errors in regression is :
n n
SSE =  e 2
i   (y i  yˆ i ) 2
i =1 i =1

The least squares regression line is that whic h minimizes the SSE
with respect to the estimates b 0 and b1 .

The normal equations :

n n

y
i =1
i  nb0  b1  x i
i =1

n n n

x
i =1
i y i b0  x i  b1  x i2
i =1 i =1
Least Squares Estimators

Sums of Squares and Cross Products:


  x
2

SSx   (x  x )   x
2 2

n 2
SS y   ( y  y )   y 
2 2   y
n
SSxy   (x  x )( y  y )   xy 
  x  ( y )
n
Least  squares regression estimators:

SS XY
b1 
SS X

b0  y  b1 x
Example 10-1: Aczel & Sounderpandian

• Amex belief: its cardholders tend to travel


more extensively than others
• Market Research survey conducted to
determine the travel (in Miles, say X) and
charges (in Dollars, say Y) on Amex card
• 25 cardholders surveyed through a mailed
questionnaire
• To study appropriate regression of Y on X
Example 10-1: Aczel & Sounderpandian (Page 436)
Miles Dollars Miles 2 Miles*Dollars
1211 1802 1466521 2182222
1345 2405 1809025 3234725
1422 2005 2022084 2851110
1687 2511 2845969 4236057
2  x 2
1849 2332 3418801 4311868 SS x   x 
2026 2305 4104676 4669930 n
2133 3016 4549689 6433128 2
2253 3385 5076009 7626405 79, 448
2400 3090 5760000 7416000
 293, 426 ,946   40,947 ,557 .84
25
2468 3694 6091024 9116792
2699 3371 7284601 9098329  x ( y )
2806 3998 7873636 11218388 SS xy   xy 
3082 3555 9498724 10956510 n
3209 4692 10297681 15056628 (79, 448 )(106 ,605 )
3466 4244 12013156 14709704  390 ,185,014   51, 402 ,852 .4
3643 5298 13271449 19300614 25
3852 4801 14837904 18493452
SS 51, 402 ,852 .4
b  XY 
4033 5147 16265089 20757852
4267 5738 18207288 24484046
 1.255333776  1.26
1 SS 40,947 ,557 .84
4498 6420 20232004 28877160 X
4533 6059 20548088 27465448
106,605  79,448 
4804 6426 23078416 30870504
b  y b x   (1.255333776 ) 
5090
5233
6321
7026
25908100
27384288
32173890
36767056
0 1 25  25 
5439 6964 29582720 37877196  274 .85
79,448 106,605 293,426,946 390,185,014
Scatterplot and the fitted least square line
Residual Plot: Plot of x vs. residuals (y – fitted y)
Miles = x; Dollars = y; fitted y = 274.85 + 1.26x

Residual Analysis. The plot shows the absence of a relationship


between the residuals and the X-values (miles).
Error Variance and the Standard Errors of
Regression Estimators
Y
Degrees of Freedom in Regression:

df = (n - 2) (n total observations less one degree of freedom


for each parameter estimated (b 0 and b1 ) )
2
2 ( SS XY ) Square and sum all
SSE =  ( Y - Y )  SSY  regression errors to find
SS X
SSE.
= SSY  b1SS XY X
Example 10 - 1:
2 2 SSE = SS Y  b1 SS XY
An unbiased estimator of s , denoted by S :
 66855898  (1.255333776 )( 51402852 .4 )
SSE  2328161.2
MSE =
(n - 2) SSE 2328161.2
MSE  
n2 23
 101224 .4
s  MSE  101224 .4  318.158
Standard Errors of Estimates in
Regression
The standard error of b0 (intercept): Example 10 - 1:
2
s x
s(b0 ) 
s(b0 ) 
s  x 2
nSS X
nSS X 
318.158 293426944
( 25)( 4097557.84 )
where s = MSE  170.338
s
s(b1 ) 
The standard error of b1 (slope): SS X
318.158
s 
s(b1 )  40947557.84
SS X  0.04972
Confidence Intervals for the
Regression Parameters
A (1 -  ) 100% confidence interval for b :
0
b  t  s (b ) Example 10 - 1
0  ,(n 2 ) 0
2  95% Confidence Intervals:
b t s (b )
A (1 -  ) 100% confidence interval for b : 0  0.025,( 25 2 ) 0
1
b  t  s (b ) = 274.85  ( 2.069) (170.338)
1  ,(n 2 ) 1
2   274.85  352.43
 [ 77.58, 627.28]

b1  t s (b1 )
 0.025,( 25 2 )
= 1.25533  ( 2.069) ( 0.04972 )
 1.25533  010287
.
 [115246
. ,1.35820]
Hypothesis Tests for the
Regression Slope
Example 10 -1:
H :  0
0 1
H :  0
1 1
b
t(n - 2)  1
s(b )
1
= 1.25533  25.25
0.04972

t(0.005,23)  2.807  25.25


H is rejected at the 1% level and we may
0
conclude that ther e is a relationsh ip between
charges and miles traveled.
How Good is the Regression?
The coefficient of determination, r2, is a descriptive measure of the strength of
the regression relationship, a measure of how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)

Y

Y
Unexplained Deviation

Explained Deviation
{
}
{
Total Deviation

SST
2
= SSE
2
 ( y  y )   ( y  y)   ( y  y )
+ SSR

Percentage of
2

2 SSR SSE
r   1 total variation
SST SST
X explained by
X the regression.
The Coefficient of Determination
7000

Example 10 -1: 6000

5000

Dollars
SSR 64527736.8 4000

r 
2
  0.96518 3000
SST 66855898 2000

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
ANOVA Table and an F Test of the
Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio

Regression SSR (1) MSR MSR


MSE
Error SSE (n-2) MSE
Total SST (n-1) MST

Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
The k-Variable Multiple Regression Model

The population regression model of a


x2
dependent variable, Y, on a set of k
independent variables, X1, X2,. . . , Xk is
y 2
given by:

Y= 0 + 1X1 + 2X2 + . . . + kXk +


1
where 0 is the Y-intercept of the 0
regression surface and each i , i = 1,2,...,k
is the slope of the regression surface -
sometimes called the response surface -
x1
with respect to Xi. y   0   1 x1   2 x 2  

Model assumptions:
1. ~N(0,2), independent of other errors.
2. The variables Xi are uncorrelated with the error term.
Simple and Multiple Least-
Squares Regression
Y y

x1
y  b0  b1x
X x2 y  b0  b1x1  b2 x 2

In a simple regression model, In a multiple regression model,


the least-squares estimators the least-squares estimators
minimize the sum of squared minimize the sum of squared
errors from the estimated errors from the estimated
regression line. regression plane.
The Estimated Regression Relationship

The estimated regression relationship:


Y  b0  b1 X 1  b2 X 2 bk X k

where Y is the predicted value of Y, the value lying on the


estimated regression surface. The terms bi, for i = 0, 1, ....,k are
the least-squares estimates of the population regression
parameters i.
The actual, observed value of Y is the predicted value plus an
error:
yj = b0+ b1 x1j+ b2 x2j+. . . + bk xkj+e, j = 1, …, n.
Least-Squares Estimation:
The 2-Variable Normal Equations

Minimizing the sum of squared errors with respect to the


estimated coefficients b0, b1, and b2 yields the following
normal equations which can be solved for b0, b1, and b2.

 y  nb  b  x  b  x
0 1 1 2 2

x y b x b x b x x
2

1 0 1 1 1 2 1 2

x y b x b x x b x
2

2 0 2 1 1 2 2 2
Example 11-1: Aczel & Sounderpandian

• Alka-Seltzer in-store promotional campaign


• In addition to the usual radio & TV commercials
• Data recorded on
– X1: expenditure on radio & TV advertising
– X2: spending on in-store display
– Y: sales for each week
• To study impact of X1 and X2 on Y through a
multiple linear regression
• Y = β0 + β1X1 + β2X2 + ε
Example 11-1 (Aczel & Sounderpandian, Page 493)
Y X1 X2 X1X2 X12 X22 X1Y X2Y Normal Equations:
72 12 5 60 144 25 864 360
76 11 8 88 121 64 836 608
78 15 6 90 225 36 1170 468 743 = 10b0+123b1+65b2
70 10 5 50 100 25 700 350 9382 = 123b0+1615b1+869b2
68 11 3 33 121 9 748 204
80 16 9 144 256 81 1280 720 5040 = 65b0+869b1+509b2
82 14 12 168 196 144 1148 984
65 8 4 32 64 16 520 260 b0 = 47.164942
62 8 3 24 64 9 496 186
90 18 10 180 324 100 1620 900
b1 = 1.5990404
b2 = 1.1487479
743 123 65 869 1615 509 9382 5040

Estimated regression equation:

Y  47164942
.  15990404
. X 1  11487479
. X2
Decomposition of the Total Deviation in a Multiple
Regression Model


y
 Y  Y: Error Deviation
Total deviation: Y  Y
Y  Y : Regression Deviation
y

x1

x2
Total Deviation = Regression Deviation + Error Deviation
SST = SSR + SSE
The F Test of a Multiple Regression Model

A statistical test for the existence of a linear relationship between Y and any or
all of the independent variables X1, X2, ..., Xk:
H0: 1 = 2 = ...= k= 0
H1: Not all the i (i=1,2,...,k) are equal to 0

Source of Sum of Degrees of


Variation Squares Freedom Mean Square F Ratio

Regression SSR k SSR


MSR 
k
Error SSE n - (k+1) SSE
MSE 
( n  ( k  1))
Total SST n-1 SST
MST 
( n  1)
Analysis of Variance Table (Example 11-1)

F Distribution with 2 and 7 Degrees of Freedom The test statistic, F = 86.34, is greater
f(F) than the critical point of F(2, 7) for any
Test statistic  86.34
common level of significance
(p-value 0), so the null hypothesis is
=0.01
rejected, and we might conclude that
the dependent variable is related to
0
F one or more of the independent
F0.01=9.55 variables.
How Good is the Regression ?

y The mean square error is an unbiased


estimator of the variance of the population
2
errors,  , denoted by  :
SSE  ( y  y) 2
MSE  
( n  ( k  1)) ( n  ( k  1))
x1
Standard error of estimate:
Errors: y - y s= MSE
x2
2
The multiple coefficient of determination, R , measures the proportion of
the variation in the dependent variable that is explained by the combination
of the independent variables in the multiple regression model:
SSR SSE
R2 = =1-
SST SST
Decomposition of the Sum of Squares and
the Adjusted Coefficient of Determination

2 SSR SSE
R = = 1-
SST SST

The adjusted multiple coefficient of determination, R 2, is the coefficient of


determination with t he SSE and SST divided by their respective degrees of freedom:
SSE
R 2 =1- (n -(k +1))
SST
(n -1)

Example 11-1: s = 1.911 R-sq = 96.1%


R-sq(adj) = 95.0%
ANOVA Table

Source of Sum of Degrees of


Variation Squares Freedom Mean Square F Ratio

Regression SSR (k) MSR


SSR F 
MSR  MSE
k
Error SSE (n-(k+1)) SSE
=(n-k-1) MSE 
(n  ( k  1))
Total SST (n-1) SST
MST 
( n  1)

2
2 SSR SSE R ( n  ( k  1)) SSE
R = = 1- F  2 MSE
(n - (k +1))
SST SST (1  R ) (k ) R 2 = 1- =1 -
SST MST
(n -1)
References
• Statistics for Management: Srivastava,
T.N. & Rego, S. (Tata McGraw-Hill)
• Research for Marketing Decisions: Green,
P.E., Tull, D.S., & Albaum, G. – 5th Ed.
(Prentice Hall – India)
• Complete Business Statistics: Aczel, A.D.
& Sounderpandian, J. – Sixth Edition (Tata
McGraw-Hill)
• https://www.datacamp.com/community/tut
orials/tutorial-ridge-lasso-elastic-net

You might also like