You are on page 1of 119

Introduction to Statistical Methods

BITS Pilani Prof.Gangaboraiah PhD


Bangalore Campus
BITS Pilani

Session
BITS Pilani11: Multiple Linear Regression Analysis
Bangalore Campus
About me
 Dr.Gangaboraiah, PhD (Stats)
 Former Professor of Statistics, KIMS, Bangalore
 Work Experience
 Kempegowda Institute of Medical Sciences, Bangalore (34 years)
 Govt. Homeopathy Medical College, Bangalore (4 years)
 SJC Institute of Technology, Chickballapur (13 years, Visiting Professor)
 Manipal University, Bangalore Centre (Since 2008, Visiting Professor)
 MS (Computer Science), MS (Computer Network)
 Data Science
 BITS (Since 2013, Visiting Professor)
 MTech (Data Science)
 WIPRO and Aricent (2019)
Prof.Gangaboraiah PhD (Stats) | Slide 3 Former Professor of Statistics | KIMS, B’lore
Agenda
Here’s what you will learn in the entire unit:
1 Data Visualization: Why? What? How?

2 Measures of Central Tendency


2 Measures of Dispersion/ Variation

Prof.Gangaboraiah PhD (Stats) | Slide 4 Former Professor of Statistics | KIMS, B’lore


Learning objectives of this unit
• At the end of the session, the student should be able to
• Define and interpret Covariance.

• Define Correlation

• Identify types of correlation. Solve problems on


Correlation

• Define Regression.

• Identify types of regression. Solve problems on


regression
Prof.Gangaboraiah PhD (Stats) | Slide 5 Former Professor of Statistics | KIMS, B’lore
Session 11:
Multiple Linear Regression
Analysis
Learning objectives of this unit
• At the end of the session, the student should be able to
• Define and interpret Covariance.

• Define Correlation

• Identify types of correlation. Solve problems on


Correlation

• Define Regression.

• Identify types of regression. Solve problems on


regression
Prof.Gangaboraiah PhD (Stats) | Slide 7 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis

Prof.Gangaboraiah PhD (Stats) | Slide 8 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis

• Many applications of regression analysis


involve situations in which there are more
than one regressor variable.
• A regression model that contains more than
one regressor variable is called a multiple
regression model.

Prof.Gangaboraiah PhD (Stats) | Slide 9 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis

Suppose that there is a study on the prognosis


(improvement) of patients at the time of
diagnosis for a certain cancer for which there
is not, as yet, an effective treatment

Prof.Gangaboraiah PhD (Stats) | Slide 10 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
The physician might surmise that the length
of survival for a patient would depend on
Patient’s age
 Conceptually this relationship could be
explained as follows:
Cancer prognosis varies with Age
Mathematically,
 Cancer prognosis = Age
Prof.Gangaboraiah PhD (Stats) | Slide 11 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis

• Assign suitable weighting factor based on its


• relative importance predicting prognosis
Cancer prognosis = W1 Age
For the above equation become useful two more things are
needed
♦ some sort of an anchor point
♦ an error term
Cancer prognosis = W0 + W1Age + Error term

This is called a Simple Linear Regression (SLR) model.


Prof.Gangaboraiah PhD (Stats) | Slide 12 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
In general, denoting
♦ Cancer prognosis by Y
♦ Anchor point by a
♦ Weight factor by b
♦ Age by X
♦ Error term by ε

Cancer prognosis = W0 + W1Age + Error term


is given by
Y = a + bX + ε
Prof.Gangaboraiah PhD (Stats) | Slide 13 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
The physician might surmise that the length of survival
for a patient would depend on at least four things:
● Patient’s age
● The anatomic stage of the disease at the time of
diagnosis
● The presence or absence of other diseases
(co-morbidity)
● The degree of systemic symptoms such as weight
loss
Prof.Gangaboraiah PhD (Stats) | Slide 14 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
 Conceptually this relationship could be explained as
follows:
Cancer prognosis varies with Age,
Stage, Co-morbidity and Symptoms
Mathematically,
 Cancer prognosis = Age + Stage + Co-morbidity
+ Symptoms
 All four independent variables are not necessarily of
equal importance
 Assign suitable weighting factor based on its relative
importance predicting prognosis
Prof.Gangaboraiah PhD (Stats) | Slide 15 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis

Cancer prognosis = W1 Age + W2 Stage + W3 Co-morbidity


+ W4 Symptoms
For the above equation become useful two more things
are needed
● some sort of an anchor point
● an error term
Cancer prognosis = W0 + W1Age + W2Stage
+ W3 Co-morbidity
+ W4 Symptoms + Error term

Prof.Gangaboraiah PhD (Stats) | Slide 16 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
By denoting say,
Y = Cancer prognosis
w0 = Anchor point
X1 = Age
X2 = Stage
X3 = Co-morbidity
X4 = Symptoms
e = Error term
The statistical model is
Y = w 0 + w 1 X1 + w 2 X 2 + w 3 X3 + w 4 X4 + e
This model is called multivariate regression model
Prof.Gangaboraiah PhD (Stats) | Slide 17 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
In general a dependent or response or outcome
variable Y may be related to k independent or
explanatory or regressor variables. The model
based on k parameters is
Y = β0 + β1X1 + β2X2 + β3X3 + … +βKXk + ε
is called multiple linear regression model with k
regressor variables. The parameters βj, j =0, 1,
2, …, k are called regression coefficients. This
model describes a hyperplane in k dimensional
space of the regressor variables {Xj}.
Prof.Gangaboraiah PhD (Stats) | Slide 18 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
The parameters βj, represents the expected
change in response variable Yj per unit change
in Xj, when all the remaining regressor Xi (i ≠ j)
are held constants.

Prof.Gangaboraiah PhD (Stats) | Slide 19 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
• The least squares normal Equations are

• The solution to the normal Equations are the least


squares estimators of the regression coefficients.
Prof.Gangaboraiah PhD (Stats) | Slide 20 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Based on the sample data the model can be written as
Y  b0  b1X1  b 2 X 2  b3X3  ...  b k X k  ε
The normal equations obtained from least squares
principles are
n n n n

Y
i 1
i  nb 0  b1  X1i  b 2  X 2i  ...  b k  X ki
i 1 i 1 i 1
n n n n n

X Yi  b 0  X 1i  b1  X1i  b 2  X1i X 2i  ...  b k  X1i X ki


2
1i
i 1 i 1 i 1 i 1 i 1

...
n n n n n 2

X
i 1
ki Yi  b 0  X ki  b1  X1i X ki  b 2  X 2i X ki  ...b k  X ki
i 1 i 1 i 1 i 1

Prof.Gangaboraiah PhD (Stats) | Slide 21 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
When k = 2, the model becomes

The normal equations obtained from least squares


principles are
n n n

 Y nβ
i 1
i 0  β1  X1i  β 2  X 2i
i 1 i 1
n n n n

X Yi  β 0  X 1i  β1  X1i  β 2  X1i X 2i
2
1i
i 1 i 1 i 1 i 1
n n n n

X Yi  β 0  X 2i  β1  X1i X 2i  β 2  X 2i
2
2i
i 1 i 1 i 1 i 1

Prof.Gangaboraiah PhD (Stats) | Slide 22 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
For k = 2, based on the sample data the model can be
written as Y  b 0  b1X1  b 2 X 2  ε
The normal equations obtained from least squares
principles are
n n n

 Y nb
i 1
i 0  b1  X1i  b 2  X 2i
i 1 i 1
n n n n

X Yi  b 0  X 1i  b1  X1i  b 2  X1i X 2i
2
1i
i 1 i 1 i 1 i 1
n n n n

X Yi  b 0  X 2i  b1  X1i X 2i  b 2  X 2i
2
2i
i 1 i 1 i 1 i 1

Prof.Gangaboraiah PhD (Stats) | Slide 23 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Solving these normal equations the predicted
regression line is given by

Ŷ  b̂ 0  b̂1X1  b̂ 2 X 2
For given values of X1 and X2 Y can be
predicted

Prof.Gangaboraiah PhD (Stats) | Slide 24 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
For example, suppose that the effective life of a
cutting tool depends on the cutting speed and the
tool angle. A possible multiple regression model
could be Y = b0+b1X1+b2X2+ε
where
Y - tool life
X1 - cutting speed
X2 - tool angle

Prof.Gangaboraiah PhD (Stats) | Slide 25 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis

Prof.Gangaboraiah PhD (Stats) | Slide 26 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis

Prof.Gangaboraiah PhD (Stats) | Slide 27 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis

Prof.Gangaboraiah PhD (Stats) | Slide 28 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Pull strength  b 0  b1Wire length  b 2 Die height  ε
Y  b 0  b1X1  b 2 X 2  ε
The normal equations are
25b 0  206b 1  8294b 2  725.82
206b 0  2396b 1  77177b 1  8008.47
8294b 0  77177b 1  3531848b 2  274816.71
Solving these normal equations we get
b0 = 2.26379, b1 = 2.74427, b2 = 0.01253
The fitted regression line is
Y = 2.26379 + 2.74427 X1 + 0.01253 X2
Prof.Gangaboraiah PhD (Stats) | Slide 29 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
yˆ  a  b1 x1  b2 x2  b3 x3  ...  bk xk  


X3
X1

X2

Prof.Gangaboraiah PhD (Stats) | Slide 30 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
STATITICAL DATA ANALYSIS
COMMON TYPES OF ANALYSIS?

1. Compare Groups
Compare Proportions (e.g., Chi Square Test -2)
a.

H0: P1 = P2 = P3 = … = Pk

b. Compare Means (e.g., Analysis of Variance)


H0: µ1 = µ2 = µ3 = …= µk

Prof.Gangaboraiah PhD (Stats) | Slide 31 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Examine Strength and Direction of Relationships
Bivariate (e.g., Pearson Correlation—r)
a.

Between one variable and another:


Y = a + b 1 x1
b. Multivariate (e.g., Multiple Regression Analysis)
Between one dependent variable and each of
several independent variables, while holding
all other independent variables constant:
Y = a + b1 x1 + b2 x2 + b3 x3 + … + bk xk
Prof.Gangaboraiah PhD (Stats) | Slide 32 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
What does regression analysis do?
Examines whether changes/differences in values of one
variable (dependent variable Y) are linked to
changes/differences in values of one or more other
variables (independent variables X1, X2, etc.), while
controlling for the changes in values of all other Xs.
E.g., Relationship between salary and gender for people who have the same levels of education, work
experience, position level, seniority, etc.

The DV (Y) must be metric.


The IVs (Xs) must be either metric or dummy variable

Prof.Gangaboraiah PhD (Stats) | Slide 33 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Central Question Addressed:
Is Y a function of X1, X2, etc.? How ?
Is there a relationship between Y and X1, X2 ,
etc., (in each case, after controlling for the
effects of all other Xs)? In what way?
What is the relative impact of each X on Y,
holding all other Xs constant (that is, all
other Xs being equal)?

Prof.Gangaboraiah PhD (Stats) | Slide 34 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
More specifically,
Do values of Y tend to increase/decrease as ŷ
values of X1, X2, etc. increase/decrease?
X3
If so,
By how much? X1
and
How strong is the connection/ relationship between Xs and Y?
X2
what % of differences/variations in Y values (e.g., income)
among study subjects can be explained by (or attributed
to) differences in X values (e.g. years of education, years of
experience, etc.)?

Prof.Gangaboraiah PhD (Stats) | Slide 35 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
NOTE: Once we can determine how values of Y change as a
function of values of X1, X2, etc., we will also be able to predict/
estimate the value of Y from specific values of X1, X2, etc.

Y = a + b1 x1 + b2 x2 + b3 x3 + … + bk xk+є
Therefore, regression analysis, in a sense, is about ESTIMATING
values of Y, using information about
values of Xs:
Estimation, by definition, involves?
The objective?
To minimize error in estimation.
Or, to compute estimates that are as close to the true/actual
values as possible.

Prof.Gangaboraiah PhD (Stats) | Slide 36 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
QUESTION: What is the simplest way to obtain an
estimate for some population characteristic
(e.g., number of credit cards per Indian household)?
ANSWER:
1. Select a representative sample from the population and
2. Compute the mean for that sample (e.g., compute the
average number of CCs for the sample households).

Regression analysis can be viewed as a technique that often


significantly improves the accuracy of estimation results relative
to using the mean value. X
So, suppose we were to estimate the number of credit cards for
Indian households, based on information from a random sample
of, say, n = 8 families.
Prof.Gangaboraiah PhD (Stats) | Slide 37 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Credit
Number Cards
i yi
1 4 ŷ  Estimate?
2 6 56
3 6
yˆ  y  7
8
4 7
5 8 QUESTION: Can we determine
6 7 how much error in estimation we
7 8 are committing by using Y  7 as
our estimate, for each of these
8 10
households?
Y
i  56
Prof.Gangaboraiah PhD (Stats) | Slide 38 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Estimate for # Error in
Number Credit Cards of Credit Cards Estimation
i
1 4 7 ?
2 6 7 ?
3 6 7 ?
4 7 7 ?
5 8 7 ?
6 7 7 ?
7 8 7 ?
8 10 7 ?

 y  56
56
i
yˆ  y   7
8
Prof.Gangaboraiah PhD (Stats) | Slide 39 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Estimate for # Error in
Number Credit Cards of Credit Cards Estimation
i yi yˆ  y yi  y
1 4 7 -3
2 6 7 -1
3 6 7 -1
4 7 7 0
5 8 7 +1
6 7 7 0
7 8 7 +1
7 Lets now see all
8 10 +3
this graphically
 yi  56 yˆ  y 
56
8
7
Prof.Gangaboraiah PhD (Stats) | Slide 40 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
10
F8
9
F5
8 F7
F6 ^
7
F4 Y Y Estimat
6 F2, F3 e
5
4 F1
3
2
1
Let’s spread the dots away from each other to
0
see things more clearly!
Prof.Gangaboraiah PhD (Stats) | Slide 41 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
10 F8
Graphic Representation
9
F5
Actual Estimate
8 F7
^
Estimate
7
F4 F6 Y Y
6 F3

F2
5
4 F1
Estimation Error
3
2 Can we determine the
total estimation
1
error for all 8
0 families?
Prof.Gangaboraiah PhD (Stats) | Slide 42 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Family Actual # of Estimate for # Error in
Number Credit Cards of Credit Cards Estimation
i
yi yˆ  y yi  y
1 4 7 -3
2 6 7 -1
3 6 7 -1 What would
4 7 7 0
be the total
5 8 7 +1
6 7 7 0
estimation
7 8 7 +1 error for all 8
8 10 7 +3 families
 yi  56 yˆ  y 
56
7  ( y  y) = 0
i
combined?
8
Prof.Gangaboraiah PhD (Stats) | Slide 43 Solution?
Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Estimate for # Error in Errors
Number Credit Cards of Credit Cards Estimation Squared
yi yˆ  y yi  y 2
( yi  y )
1 4 7 -3 9
2 6 7 -1 1
3 6 7 -1 1
4 7 7 0 0
5 8 7 +1 1
6 7 7 0 0
7 8 7 +1 1
8 10 7 +3 9
 yi  56 yˆ  y 
56
8
7  ( yi  y )  0  ( yi  y )  22
2 SST = Sum of
Squares Total

Prof.Gangaboraiah PhD (Stats) | Slide 44 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
22 = SST = Index for total (combined) amount of estimation error
for all families (observations) in the sample when using the mean
as the estimate.
 SST is also the sum of squared deviations from the mean.
o Remember the formula for computing Variance?

• Objective in Estimation?
Minimize error, maximize precision.

• Can we cut down the amount of estimation error (SST)? How?


Yes, we can, by using information about other variables suspected
to be strong predictors (strongly related to) # of credit cards
possessed by families (e.g., family size, family income, etc.)..

Prof.Gangaboraiah PhD (Stats) | Slide 45 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Family Actual # of Family We now can attempt to
Number Credit Cards Size estimate # of credit cards
y x
from the information on
1 4 2 family size, rather than
2 6 2 from its own mean.
3 6 4 Let’s first see this
4 7 4 graphically!
5 8 5
6 7 5
7 8 6
8 10 6

Prof.Gangaboraiah PhD (Stats) | Slide 46 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
Y F8
10 Plot actual numbers of
9 CCs against family Size.
8 F5 F7
7 F2 F4 F6 yˆ  y
6 Original (Baseline)

5
F3 Estimate

F1 QUESTION: Does the mean (y ) appear to


4 x  2, y  4 represent the closest estimate of the
3 actual c.c. numbers for our sample
families ?
2 That is, is the green line the best line to
represent the location of estimates of #
1 of CC for these families?
0 X Family Size
1
Prof.Gangaboraiah PhD (Stats) | 2 Slide 347 4 5 6 7Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Y yˆ  a1  b1 x
F8
10 Generic Equation for any yˆ  a3  b3 x
9 straight line: Y= a + bx Regression Line
F4 yˆ  a2  b2 x
8 F5 F7
7
F2 yˆ  y
F6 Original (Baseline)
6 Estimate
F3 yˆ  a  0 x  y
5
4
F1
3 Regression Line (Line of Best
2 Fit)--new improved location for
CC estimates (see next slide)
1
0 X Family Size
1
Prof.Gangaboraiah PhD (Stats) | 2
Slide 48 3 4 5 6 7 Professor of Statistics | KIMS, B’lore
Former
Multiple Linear Regression Analysis
Y F8 yˆ  a  bx
10
Reg. Line (Line of
9 Best Fit)--new
improved location
8 F5 F7 for CC estimates
F2 F4
7 y Original
F6 (Baseline)
6
F3 Estimate
5
4 Estimation ERROR ( y  yˆ )
F1
3
Regression Line will
2
2
Minimize  ( y  yˆ ) = total estimation error.
1 But, how do we know the values a and b in
yˆ  a  bx (the reg.
0 line)? Family
1
Prof.Gangaboraiah PhD (Stats) | 2 3 4 5 6 Professor
X
7 of Statistics | KIMS, B’lore
Slide 49 Former Size
Multiple Linear Regression Analysis
EQUATION FOR REGRESSION LINE (LINE OF BEST
FIT) - Values of a and b for the regression line:

ˆ  ( x  x)( y  y)
b
 ( x  x ) 2

ˆ
yˆ  aˆ  bx
aˆ  y  bˆ x
Let’s use above formulas to compute the values of
“a” and “b” for the regression line in our example.
2

We will need: y , x ,  ( x  x )( y  y ), and  (x  x)

Prof.Gangaboraiah PhD (Stats) | Slide 50 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis 2
We need: y, x,  ( x  x )( y  y), and  ( x  x )
Family Actual # of Family
xx
2
Number Credit Cards Size y y ( x  x )( y  y ) (x  x)
i y x
1 4 2 ? ? ? ?
2 6 2 ? ? ? ?
3 6 4 ? ? ? ?
4 7 4 ? ? ? ?
5 8 5 ? ? ? ?
6 7 5 ? ? ? ?
7 8 6 ? ? ? ?
8 10 6 ? ? ? ?

Y 
56
7
34
x   4.25  ( x  x )( y  y )  ?  ( x  x ) 2
?
8 8
Prof.Gangaboraiah PhD (Stats) | Slide 51 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis 2
We need: y, x,  ( x  x )( y  y), and  ( x  x )
Family Actual # of Family
xx
2
Number Credit Cards Size y y ( x  x )( y  y ) (x  x)
i y x
1 4 2 -2.25 -3 6.75 5.0625
2 6 2 -2.25 -1 2.25 5.0625
3 6 4 -.25 -1 .25 .0625
4 7 4 -.25 0 0 .0625
5 8 5 .75 1 .75 .5625
6 7 5 .75 0 0 .5625
7 8 6 1.75 1 1.75 3.0625
8 10 6 1.75 3 5.25 3.0625

Y 
56
7
34
x   4.25  ( x  x )( y  y )  ?  ( x  x ) 2
?
8 8
Prof.Gangaboraiah PhD (Stats) | Slide 52 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
REGRESSION LINE (LINE OF BEST FIT):

 (x  x)(y  y) 17
b̂    0.971
 (x  x) 2 17.5
ˆ
yˆ  aˆ  bx
aˆ  y  b x  7  0.971(4.25)  2.87
aˆ  2.87 bˆ  0.971

yˆ  2.87  0.971x
? ?
Y-Intercept Regression Coefficient
Prof.Gangaboraiah PhD (Stats) | Slide 53 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Y 10 F8 yˆ  2.87  0.971x
9 New
Improved
8
F F Estimates
F F 5
7
2 4
7 y Original
(Baseline)
6
F6 Estimate

5
F
3
4
F
3 Can 1
we tell how much estimation error we have
2 committed by using the new regression line?
Yes, examine differences between our household’s
1
actual # of CCs and their new/regression estimates.
0 X Family
1
Prof.Gangaboraiah PhD (Stats) | 2
Slide 54 3 4 5 Professor
Former 6 of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
ŷ  2.87  0.97 x ŷ
Family Actual # of Family Regression Error Errors
Number Credit Cards Size Estimate (Residual) Squared
y  yˆ
2
i y x ŷ ( y  yˆ )
1 4 2 ? ? ?
2 6 2 ? ? ?
3 6 4 ? ? ?
4 7 4 ? ? ?
5 8 5 ? ? ?
6 7 5 ? ? ?
7 8 6 ? ? ?
8 10 6 ? ? ?
Prof.Gangaboraiah PhD (Stats) | Slide 55 Former Professor  y)ˆ
 (ofyStatistics
2
| KIMS, B’lore
Multiple Linear Regression Analysis
ŷ  2.87  0.97 x yˆ  2.87  .97(2)  4.81
Family Actual # of Family Regression Error Errors
Number Credit Cards Size Estimate (Residual) Squared
y  yˆ
2
i y x ŷ ( y  yˆ )
1 4 2 4.81 -.81 .66
2 6 2 4.81 1.19 1.42
3 6 4 6.76 -.76 .58
4 7 4 6.76 .24 .06
5 8 5 7.73 .27 .07
6 7 5 7.73 -.73 .53
7 8 6 8.7 -.7 .49
8 10 6 8.7 1.3 1.69

 ( y  yˆ )  5.486
2
SSE = Sum of Squares Error (SS Residual)
Prof.Gangaboraiah PhD (Stats) | Slide 56 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Total Baseline Error using the mean (SS Total) 22.0
New or Remaining Error (SS Error or SS Residual) 5.486 ~ 5.5
QUESTION: How much of the original estimation error have we explained Total Var. in
away (eliminated) by using the regression model (instead of the mean)? Y = 22

22 – 5.486 = 16.514 (SS Regression or SS Explained) 5.5


16.5
Y the
X1 by using
QUESTION: What % of estimation error have we explained (eliminated
regression model?
R2 = 16.514 / 22 = .751 or 75% What is this called?
% of differences in # of CCs among households that is explained by
differences in their family size.
What does the remaining 25% represent?
Percent of variation (differences) in number of credit cards owned by families that
can be accounted for by: (a) all other potential predictors not included in the
model, beyond family size, and (b) unexplainable random/chance variations.
Prof.Gangaboraiah PhD (Stats) | Slide 57 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
R2 = SS Regression / SS Total = 16.5/22 = 75%
R2 is a measure of our success regarding accuracy of our estimation effort.
 R2 =% of estimation error that we have been able to explain away by
using the regression model, instead of using the mean.
 R2 indicates how much better we can predict Y from information about
Xs, rather than from using its own mean.
 R2 = % of differences (variations) in Y values that is explained by
(attributable to) differences in X values.
Note: When dealing with only two variables (a single X and Y):
Pearson Correlation
16.514
r R 2
 0.75  0.866of Y with X1
22 (NOT controlling for any other varariable)

Let’s now examine all this graphically!


Prof.Gangaboraiah PhD (Stats) | Slide 58 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Y F8
10 Regression Line (New Improved
yˆ  2.87  0.97 x
Estimates):
9
8 F5 F7
F2 F4 F6
7 y Original
y y yˆ  y (Baseline)
6 ? Explained by Estimate
REGRESSION F3
Original
Baseline
5 ERROR
Model

4 for F1
? y  yˆ
F1 New ERROR
3 (Unexplained/
RESIDUAL)
2
1
0 X Family Size
1
Prof.Gangaboraiah PhD (Stats) | 2 Slide 359 4 5 6 7Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
5.5 = SSE = The amount of estimation error for the 8
sample families when using simple regression (i.e., a
regression model that includes only information about
family size).

Can we reduce the amount of estimation


error (SSE) to an even lower level and,
thus, improving the estimation process? How?

Yes, by adding information on a second variables


suspected to be strongly related to # of credit cards (e.g.,
family income--X2).
Prof.Gangaboraiah PhD (Stats) | Slide 60 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Family Actual # of Family Family
Number Credit Cards Size Income
i yi x1 x2 We now can attempt to
estimate # of CCs from
1 4 2 14
our information on
2 6 2 16 family size and family
3 6 4 14 income!
4 7 4 17
Our regression model
5 8 5 18 will now be a linear
6 7 5 21 plane, rather than a
7 8 6 17 straight line!
8 10 6 25
Generic Equation for a linear plane: yˆ  a  b1 x1  b2 x2
Let’s examine the regression plane for our example graphically.
Prof.Gangaboraiah PhD (Stats) | Slide 61 Former Professor of Statistics | KIMS, B’lore
12 Y = # of Credit Cards
yˆ  a  b1 x1  b2 x2 11
Formulas are available for 10
computing values of
a, b1 and b2 9
8
MULTIPLE REGRESSION
MODEL FOR OUR EXAMPLE: 7 Family Income
yˆ  0.482  0.63 x1  0.216 x2 6

Let’s now see 5


how much error 4
in estimation we 3
are committing by 2
using this 1 Actual
multiple Regression Estimate
0
regression model.
Prof.Gangaboraiah PhD (Stats) | Slide 62 X1 =ofFamily
Former Professor Size
Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
yˆ  0.482  0.63 x1  0.216 x2 ŷ
Family Actual # of Family Family Income Regression Error Errors
Number Credit Cards Size ($000) Estimate (Residual) Squared
2
i
y x1 x2 Yˆ y  yˆ ( y  yˆ )
1 4 2 14 ? ? ?
2 6 2 16 ? ? ?
3 6 4 14 ? ? ?
4 7 4 17 ? ? ?
5 8 5 18 ? ? ?
6 7 5 21 ? ? ?
7 8 6 17 ? ? ?
8 10 6 25 ? ? ?

 ( y  y)
ˆ 2

Prof.Gangaboraiah PhD (Stats) | Slide 63 Former Professor of Statistics | KIMS, B’lore


Multiple Linear Regression Analysis
yˆ  0.482  0.63 x1  0.216 x2yˆ  0.482  0.63(2)  0.216(14)  4.77
Family Actual # of Family Family Income Regression Error Errors
Number Credit Cards Size ($000) Estimate (Residual) Squared
2
i
y x1 x2 Yˆ y  yˆ ( y  yˆ )
1 4 2 14 4.77 -0.77 0.59
2 6 2 16 5.20 0.80 0.64
3 6 4 14 6.03 -0.03 0.00
4 7 4 17 6.68 0.32 0.10
5 8 5 18 7.53 0.47 0.22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 0.05 0.00
8 10 6 25 9.67 0.33 0.11
SSE = Sum of Squares Error (Residual) 3.05   ( y  yˆ ) 2
Unique (additional) contribution of X2 (family income) beyond X1 = ? 5.5 – 3.05 = 2.45
Prof.Gangaboraiah PhD (Stats) | Slide 64 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
yˆ  0.482  0.63 x1  0.216 x2
? ?
Y-Intercept, “a” b1 and b2 = Regression Coefficients
(NOTE: Only when all Xs can 0.63: Among families of the same income, an increase
meaningfully take on value of in family size by one person would, on average, result
zero, the intercept will have a in .63 more credit cards.
meaningful/direct/ practical
0.21: Among families of the same size, an income
interpretation. Otherwise, it
increase of $1,000, results in an average increase of 0.2
is simply an aid in increasing
credit cards .
accuracy of estimation.
“b”s represent effect of each X on Y when all other Xs
are controlled for/held constant/taken into account
• i.e., after impacts of all other variables are
accounted for (remember the high blood pressure-
hearing problem connection?)
Prof.Gangaboraiah PhD (Stats) | Slide 65 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
The MULTIPLE REGRESSION MODEL FOR OUR EXAMPLE:
yˆ  0.482  0.63 x1  0.216 x2
SST = 22 SSE = 3.05
2
What is our new R ?
Percent of differences in
households’ number of CCs that
SS Regression = 22 – 3.05 = 18.95 is explained by differences in
family size and family income.
2
R = 18.95 / 22 = .861 or 86% Percent of variation in number of
credit cards that can be accounted
for by (a) all other relevant factors not
included in the model, beyond family
The Remaining 14%? size and income, and (b)
unexplainable random/chance
(3.05 / 22 = .14) variations.
Prof.Gangaboraiah PhD (Stats) | Slide 66 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
a d Y= # of CC
Total Variation/Error in Y = SS Total = a + b + c + d = 22
c b
yˆ  2.87  .97 X 1 r2 = ? R2 = (a+c) / (a+b+c+d)
X1=Family Y R2 = 16.5 / 22 = 0.75
Size
What do we call the square root of this?
X2 = Family SSR = Pearson/simple r yx1 
16.5
 0.75  0.867
Income 22
Correlation
a+
X1=Family
of Y with X1 ac
c (not controlling ryx 
abcd
15.11
r  size
yx2  0.829 1
22
= 16.5 for X2)
Y yˆ  0.063  .398 X 2 r2 = (b+c) / (a+b+c+d) = 15.12 / 22 = 0.687
SSR = Pearson/simpl bc
e Correlation ryx2 
c+b of Y with X2
abcd
= 15.12
(not controlling
X2 = Family for X1) ?
Income PhD (Stats) |
Prof.Gangaboraiah Slide 67 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
a d yˆ  0.482  0.63 x1  0.216 x2

c b R2Graphically = ?
X1=Family
Size NOTE: c is explained by both X1 and X2

X2 = Family
Income

SSR = a + b +c = 18.95
SST = a + b + c + d = 22
R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22 = 86%
SSE = ?

SSE = d = 22 – 18.95 = 3.05


Prof.Gangaboraiah PhD (Stats) | Slide 68 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
yˆ  0.482  0.63 x1  0.216 x2yˆ  0.482  0.63(2)  0.216(14)  4.77
Family Actual # of Family Family Income Regression Error Errors
Number Credit Cards Size ($000) Estimate (Residual) Squared
2
i
y x1 x2 Yˆ y  yˆ ( y  yˆ )
1 4 2 14 4.77 -0.77 0.59
2 6 2 16 5.20 0.80 0.64
3 6 4 14 6.03 -0.03 0.00
4 7 4 17 6.68 0.32 0.10
5 8 5 18 7.53 0.47 0.22
6 7 5 21 8.18 -1.18 1.39
7 8 6 17 7.95 0.05 0.00
8 10 6 25 9.67 0.33 0.11
SSE = Sum of Squares Error (Residual) 3.05   ( y  yˆ ) 2
Unique (additional) contribution of X2 (family income) beyond X1 = ? 5.5 – 3.05 = 2.45
Prof.Gangaboraiah PhD (Stats) | Slide 69 Former Professor of Statistics | KIMS, B’lore
Inference on Regression coefficients

General regression model

Y   0  1 X  
1.0, and 1 are parameters
2.X is a independent variable
3.Deviations  are independent N(o,  )
2

Prof.Gangaboraiah PhD (Stats) | Slide 70 Former Professor of Statistics | KIMS, B’lore


Inference on Regression coefficients

We will write an estimated regression line based on


sample data as
yˆ  b0  b1 x
Logistic Linear
The method of Regression
least squares Analysis
chooses the values for b0,
and b1 to minimize the sum of squared errors
n n 2

SSE   ( yi  yˆ i )   y  b0  b1 x 
2

i 1 i 1

Prof.Gangaboraiah PhD (Stats) | Slide 71 Former Professor of Statistics | KIMS, B’lore


Inference on Regression coefficients
Example Y X
The weekly advertising expenditure 1250 41
(X) and weekly sales (Y) are 1380 54
presented in the following table. 1425 63
1425 54
Logistic Linear Regression Analysis 1450 48
1300 46
1400 62
1510 61
1575 64
1650 71

Prof.Gangaboraiah PhD (Stats) | Slide 72 Former Professor of Statistics | KIMS, B’lore


Inference on Regression coefficients
Using Principles of least squares, we obtain the
following estimates:
n n n n

 (X i  X)(Yi  Y) n  X i Yi   X i  Yi Sy
b̂1   or b̂1  r
i 1 i 1 i 1 i 1
n n n

 (X i  X)
i 1
2
n  X  ( X i )
i 1
2
i
i 1
2 Sx
and

b̂ 0  Y  b̂1X such that Ŷ  â  b̂1X

Prof.Gangaboraiah PhD (Stats) | Slide 73 Former Professor of Statistics | KIMS, B’lore


From previous table we have:
n  10  X  564  X  32604
2

 Y  14365  XY  818755
The least squares estimates of the regression
coefficients are:
n  XY   X  Y 10(818755)  (564)(14365)
b̂1    10.8
n  X  ( X)
2 2
10(32604)  (564) 2

b̂ 0  1436.5  10.8(56.4)  828


Prof.Gangaboraiah PhD (Stats) | Slide 74 Former Professor of Statistics | KIMS, B’lore
The estimated regression function is:
Ŷ  828  10.8X
Sales  828  10.8 Expenditure
Logistic Linear Regression Analysis
This means that if the weekly advertising
expenditure is increased by Rs.1 we would
expect the weekly sales to increase by Rs.10.8.

Prof.Gangaboraiah PhD (Stats) | Slide 75 Former Professor of Statistics | KIMS, B’lore


Fitted values for the sample data are obtained by substituting
the X value into the estimated regression function.
For example if the advertising expenditure is $50, then the
estimated Sales is:
Sales  828  10.8(50)  1368
This is called the point estimate (forecast) of the mean
response (sales).

Prof.Gangaboraiah PhD (Stats) | Slide 76 Former Professor of Statistics | KIMS, B’lore


Example: weekly advertising expenditure

Y X Y-hat Residual (e)


1250 41 1270.8 -20.8
1380 54 1411.2 -31.2
1425 63 1508.4 -83.4
1425 54 1411.2 13.8
1450 48 1346.4 103.6
1300 46 1324.8 -24.8
1400 62 1497.6 -97.6
1510 61 1486.8 23.2
1575 64 1519.2 55.8
1650 71 1594.8 55.2
Prof.Gangaboraiah PhD (Stats) | Slide 77 Former Professor of Statistics | KIMS, B’lore
The variance 2 of the error terms i in the
regression model needs to be estimated for a
variety of purposes.
It gives an indication of the variability of the
probability distributions of Y.
It is needed for making inference concerning
regression function and the prediction of Y.

Prof.Gangaboraiah PhD (Stats) | Slide 78 Former Professor of Statistics | KIMS, B’lore


To estimate  we work with the variance and take the
square root to obtain the standard deviation.
For simple linear regression the estimate of 2 is the
average squared residual.
1 1
  ei   (Yi  Ŷi )
2 2 2
To estimate  , use s y.x
n2 n2
s y. x  s y. x
2

s estimates the standard deviation  of the error term 


in the statistical model for simple linear regression.
Prof.Gangaboraiah PhD (Stats) | Slide 79 Former Professor of Statistics | KIMS, B’lore
Y X Y-hat Residual (e) square(e)
1250 41 1270.8 -20.8 432.64
1380 54 1411.2 -31.2 973.44
1425 63 1508.4 -83.4 6955.56
1425 54 1411.2 13.8 190.44
1450 48 1346.4 103.6 10732.96
1300 46 1324.8 -24.8 615.04
1400 62 1497.6 -97.6 9525.76
1510 61 1486.8 23.2 538.24
1575 64 1519.2 55.8 3113.64
1650 71 1594.8 55.2 3047.04

Y-hat = 828+10.8X total 36124.76


S y .x 67.19818
Prof.Gangaboraiah PhD (Stats) | Slide 80 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests
In our previous lectures we presented confidence intervals
and significance tests for means and differences in means.
In each case, inference rested on the standard error s of the
estimates and on t or z distributions.
Inference for the slope and intercept in linear regression is
similar in principal, although the recipes are more
complicated.
All confidence intervals, for example , have the form
estimate  t* SE(Estimate)

t* is a critical value of a t distribution.


Prof.Gangaboraiah PhD (Stats) | Slide 81 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests

Confidence intervals and tests for the slope and


intercept are based on the sampling distributions of
the estimates b1 and b0.
Here are the facts:
If the simple linear regression model is true, each of
b0 and b1 has a Normal distribution.

The mean of b0 is 0 and the mean of b1 is 1.


That is, the intercept and slope of the fitted line are
unbiased estimators of the intercept and slope of the
population regression line.
Prof.Gangaboraiah PhD (Stats) | Slide 82 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests

The standard deviations of b0 and b1 are multiples of the


model standard deviation .
S
SE b1  S(b1 ) 
 (X  X)
i
2

2
1 X
SE b0  S(b 0 )  S  n

 (X  X)
n 2
i
i 1

Prof.Gangaboraiah PhD (Stats) | Slide 83 Former Professor of Statistics | KIMS, B’lore


Confidence Intervals and Significance Tests

Prof.Gangaboraiah PhD (Stats) | Slide 84 Former Professor of Statistics | KIMS, B’lore


Confidence Intervals and Significance Tests
Let us return to the Weekly advertising expenditure and
weekly sales example. Management is interested in testing
whether or not there is a linear association between
b t ( S (b )) 
advertising expenditure and weekly sales, using regression
1
(
2
; n 2 )
1

model. Use  = .05


Hypothesis: H 0 : 1  0
H a : 1  0
Decision Rule: Reject H0 if t  t.025;8  t  2.306
or t  t.025;8  t  2.306
Prof.Gangaboraiah PhD (Stats) | Slide 85 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests
Test statistic:
b1 S y. x 67.2
t b1  10 .8 S (b1 )    2.38
S (b1 )  ( x  x ) 2
794 .4

10.8
t  4 .5
2.38
Conclusion:
Since t =4.5 > 2.306 then we reject H0.
There is a linear association between advertising expenditure
and weekly sales.
Prof.Gangaboraiah PhD (Stats) | Slide 86 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests

b1  t  ( S (b1 ))
( ; n 2 )
2

Now that the test showed that there is a linear


association between advertising expenditure and
weekly sales, the management wishes an estimate of
1 with a 95% confidence coefficient.

Prof.Gangaboraiah PhD (Stats) | Slide 87 Former Professor of Statistics | KIMS, B’lore


Confidence Intervals and Significance Tests

For a 95 percent confidence coefficient, we require t


(.025; 8). From table B in appendix III, we find t(.025;
8) = 2.306.
The 95% confidence interval is:
b1  t  ( S (b1 ))
( ; n2)
2

10.8  2.306(2.38)
10.8  5.49  (5.31, 16.3)

Prof.Gangaboraiah PhD (Stats) | Slide 88 Former Professor of Statistics | KIMS, B’lore


Analysis of variance approach to
Analysis of variance approach to
Regression analysis Regression analysis
Analysis of Variance is the term for statistical
analyses that break down the variation in data into
separate pieces that correspond to different sources of
variation.
It is based on the partitioning of sums of squares and
degrees of freedom associated with the response
variable.
In the regression setting, the observed variation in the
responses (yi) comes from two sources.
Prof.Gangaboraiah PhD (Stats) | Slide 89 Former Professor of Statistics | KIMS, B’lore
Analysis of variance approach to
Analysis of variance approach to
Regression analysis Regression analysis
The breakdowns of the total sum of squares and associated degrees of freedom are
displayed in a table called analysis of variance table (ANOVA table)

Source of
df SS MSS F - ratio
variation

Regression 1 SSR MSR = SSR/1 MSR/MSE

Error n-2 SSE MSE = SSE/n-2

Total n-1 SST


Prof.Gangaboraiah PhD (Stats) | Slide 90 Former Professor of Statistics | KIMS, B’lore
Analysis of variance approach to
Analysis of variance approach to
Regression analysis Regression analysis
The breakdowns of the total sum of squares and associated degrees of freedom are
displayed in a table called analysis of variance table (ANOVA table)

Source of
df SS MSS F - ratio
variation
92427.74 92427.74
Regression 1 20.47

36124.76 4515.6
Error 8

Total 9 128552.5
Prof.Gangaboraiah PhD (Stats) | Slide 91 Former Professor of Statistics | KIMS, B’lore
Analysis of variance approach to
Analysis of variance approach to
Regression analysis Regression analysis
F-Test for 1= 0 versus 1 0
Equivalence of F Test and t Test:
For given  level, the F test of 1 = 0 versus 1  0 is equivalent
algebraically to the two sided t-test.

Thus, at a given level, we can use either the t-test or the F-test
for testing 1 = 0 versus 1  0.

The t-test is more flexible since it can be used for one sided test
as well.
Prof.Gangaboraiah PhD (Stats) | Slide 92 Former Professor of Statistics | KIMS, B’lore
Multivariate Regression analysis

Population Model:
Y  β 0  β1X1  β 2 X 2  ...  β k X k  ε i

Sample model:

Y  b 0  b1X1  b 2 X 2  ...  b k X k  ei

Prof.Gangaboraiah PhD (Stats) | Slide 93 Former Professor of Statistics | KIMS, B’lore


Multivariate Regression analysis
To test whether explanatory variables collectively have effect on y, we
test
H0 : 1 = 2 = … = k = 0
(i.e., y independent of all the explanatory variables)
Ha: At least one i  0
(at least one explanatory variable has an effect on y, controlling for the
others in the model)

Equivalent to testing
H0: population multiple correlation = 0 (or popul. R2 = 0)
vs. Ha: population multiple correlation > 0

Prof.Gangaboraiah PhD (Stats) | Slide 94 Former Professor of Statistics | KIMS, B’lore


Multivariate Regression analysis
Test statistic (with k explanatory variables)
2
R df1 = k
k (number of explanatory variables in
F model)
(1  R) 2
df2 = n – (k + 1)
[n  (k  1)] (sample size – no. model parameters)
When H0 true, F values follow the F distribution (R. A. Fisher)

Larger R gives larger F test statistic, more evidence against


null hypothesis.

Since larger F gives stronger evidence against null, P-value =


right-tail probability above observed value
Prof.Gangaboraiah PhD (Stats) | Slide 95 Former Professor of Statistics | KIMS, B’lore
Example with two predictor variables

H0 : 1 = 2 = 0 (i.e., y independent on x1 and x2)

H1: 1  0 or 2  0 or both

Test statistic
2
R /k 0.861 / 2
F   15 . 486
(1  R )/[n  (k  1)] (1  0.861) /[8  (k  1)]
2

For df1 = 2, df2 = 5, Fobs = 15.486, P < 0.001


There is very strong evidence that at least one of the
explanatory variables is associated with no. of credit cards
Prof.Gangaboraiah PhD (Stats) | Slide 96 Former Professor of Statistics | KIMS, B’lore
Inferences for individual regression coefficients
(Need all predictors in model?)
To test partial effect of xi controlling for the other
explanatory variable’s in model, then test
H0: i = 0 vs. H1: i ≠ 0
bi  0
using test-statistic t with df = n - (k + 1)
SE(b i )

Prof.Gangaboraiah PhD (Stats) | Slide 97 Former Professor of Statistics | KIMS, B’lore


Inferences for individual regression coefficients
(Need all predictors in model?)
which is df2 from the F test (and in df column of
ANOVA table in Residual row)
CI for i has form bi ± tα/2SE(bi),
with t-score from t-table also having df = n - (k + 1), for
the desired confidence level
Software provides estimates, standard errors, t test
statistics, P-values for tests (2-sided by default)
Prof.Gangaboraiah PhD (Stats) | Slide 98 Former Professor of Statistics | KIMS, B’lore
Inference correlation coefficients

Significance Test: H0: ρ = 0 vs. H1: ρ ≠ 0


To test whether the relation is merely apparent, and/or
might have arisen by chance use the t test is applied

𝑛−2
𝑡=𝑟
1 − 𝑟2

which follows t-distribution with n-2 degrees of


freedom.

Prof.Gangaboraiah PhD (Stats) | Slide 99 Former Professor of Statistics | KIMS, B’lore


Logistic Regression Analysis

When and Why

Prof.Gangaboraiah PhD (Stats) | Slide 100 Former Professor of Statistics | KIMS, B’lore
Time interval
Outcome of Marital Method of attempt between attempt
Sl No Age (yrs) Sex Cause of suicide Religion SES Occupation Time of event
suicide status to suicide to suicide and
bring to hospital

1 Died 19 Female Married Dowry death Hindu Middle Housewife Hanging Morning 30 minutes

Failure in
2 Survived 20 Male Unmarried Hindu Middle Student Poisson 30 minutes
studies
Time of death
unknown, hifted to
3 Died 21 Male Unmarried Depression Hindu Lower Painting/ Coolie Electrical Night
hospital for
postmartum
Manager in Cane
4 Died 21 Male Unmarried Depression Hindu Middle Hanging Morning 25 minutess
Juice centre
Time of death
Failure in unknown, hifted to
5 Died 17 Male Unmarried Hindu Middle Student Hanging Night
studies hospital for
postmartum
Time of death
Problem at work unknown, hifted to
6 Died 20 Male Unmarried Hindu Lower Coolie Electrical Evening
place hospital for
postmartum

7 Survived 19 Female Unmarried Pain abdomen Hindu Upper Middle Student Poisson 70 minutes

8 Survived 16 Female Unmarried Depression Jains Middle Student Hanging 40 minutes

9 Survived 20 Male Unmarried Depression Hindu Middle Student Fall from height 35 minutes

Time of death
unknown, hifted to
10 Died 20 Male Unmarried Love failure Hindu Middle Student Hanging Morning
hospital for
Prof.Gangaboraiah PhD (Stats) | Slide 101 Former Professor of Statistics | KIMS, B’lore
postmartum
Logistic Regression Analysis
Introduction
In linear regression models it is assumed that
the dependent variable ‘Y’ should be
Quantitative (Continuous/ Discrete) and
normally distributed.
But in many instances, the dependent variable
will not quantitative instead may be categorical.
If the dependent variable is categorical, it
violates the assumption of linearity of normal
regression.
Prof.Gangaboraiah PhD (Stats) | Slide 102 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis

In many regression settings, the Y variable is (0,1)

A Few Examples:
 Consumer chooses brand (1) or not (0);
 A quality defect occurs (1) or not (0);
 A person is hired (1) or not (0);
 Evacuate home during hurricane (1) or not (0);
 Other Examples???

Prof.Gangaboraiah PhD (Stats) | Slide 103 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Scatterplot of with Y=(0,1):
Y = Hired-Not Hired; X= Experience

Y
1

0 X

Prof.Gangaboraiah PhD (Stats) | Slide 104 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Linear Probability Model (LPM)
If we estimate the slope using OLS regression:
Hired = α + *Income + e ;

The result is called a “Linear Probability Model”


 The predicted values are probabilities that Y equals 1;
 The equation is linear – the slope is constant

Prof.Gangaboraiah PhD (Stats) | Slide 105 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Picture of LPM
Scatterplot of with Y=(0,1):
Y = Hired-Not Hired; X= Experience
Y
1
LPM Regression Line
(slope coefficient)
Points on regression line represent
predicted probabilities. For Y for each value of X
0 X

Prof.Gangaboraiah PhD (Stats) | Slide 106 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
An Example: Loan Approvals
Data:
Dependent Variable:
Loaned = 1 if Loan Approved
0 if not Approved by Bank Z
Independent Variables:
ROA = net income as % of total assets of applicant;
Debt = debt as % of total assets of applicant;
Officer = 1 if loan handled by loan officer A
0 if handled by officer B;
Prof.Gangaboraiah PhD (Stats) | Slide 107 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Linear Probability Model (LPM) Weaknesses
 The predicted probabilities can be greater than 1 or less
than 0
 Probabilities, by definition, have max =1; min = 0;
 This is not a big issue if they are very close to 0 and 1
 The error terms vary based on size of X-variable
(“heteroskedastic”) –
 There may be models that have lower variance – more
“efficient”
 The errors are not normally distributed because Y takes on
only two values
 Creates problems for
 More of an issue for statistical theorists
Prof.Gangaboraiah PhD (Stats) | Slide 108 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Picture of LPM fused with S-type curve
Scatterplot of with Y=(0,1):
Y Y = Hired-Not Hired; X= Experience

1
LPM Regression Line
(slope coefficient)
Points on regression line represent
predicted probabilities. For Y for each value of X
0 X

Prof.Gangaboraiah PhD (Stats) | Slide 109 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
The model that describes the S-type curve is as
follows.
Let ‘p’ be the probability that an event ‘Y’
occurs, ie., P(Y=1)

Let ‘1 - p’ be the probability that an event ‘Y’ do


not occurs, ie., P(Y=0)

Prof.Gangaboraiah PhD (Stats) | Slide 110 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
The estimated probability P(Y) with one
predictor variable is given by

P (Y )  1 e ( 0 1X1i ) 1

and
 ( 0  1X1εi )  ( 0  1X1εi )
1 - P(Y)  1 e
1 e
1
( 0  1X1εi )  e
1 e ( 0  1X1εi )

Prof.Gangaboraiah PhD (Stats) | Slide 111 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
P(Y)
The ratio of is called the Odds
1  P(Y)
Ratio of two probabilities which is given by

P(Y) (  0  1 X 1   )
e
1  P(Y)
Prof.Gangaboraiah PhD (Stats) | Slide 112 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Relationship between Odds & Probability

Probability (Event)
Odds (Event) 
1  Probability (Event)

Odds (Event)
Probability (Event) 
1  Odds (Event)
Prof.Gangaboraiah PhD (Stats) | Slide 113 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Odds Ratio
Definition of Odds Ratio: Ratio of two odds
estimates.
So, if Pr (response |trt) = 0.40 and Pr (response | placebo) =
0.20
 0.40
Then: Odds  response| trt group    0.667
1  0.40

Oddsresponse | placebo group  
0.20
 0.25
1  0.20
 0.667
 OR  Trt vs. Placebo    2.67
0.25
Prof.Gangaboraiah PhD (Stats) | Slide 114 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
P(Y)
The logarithm of 1  P(Y) is called the

P(Y)
The logarithm of 1  P(Y)
is called the

 P(Y) 
Hence, Ln    β  β X  ε
1  P(Y) 
0 1 1

is called Logistic Regression model


Prof.Gangaboraiah PhD (Stats) | Slide 115 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis

The solution of the Logistic Regression Model


can be obtained from Maximum Likelihood
Method.

However, direct method of estimation may be


difficult beacause of complexity in function and
should be solved iteratively using computers.

Prof.Gangaboraiah PhD (Stats) | Slide 116 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Confusion matrix
Outcome of suicide
Sex Survived Died Total
Male 6 41 47
Female 14 52 66
Total 20 93 113

Prof.Gangaboraiah PhD (Stats) | Slide 117 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
a d
Sensitivit y  x 100 Specificit y  x 100
ac bd
a
Positive predictive value  x 100
ab
d
Negative predictive value  x 100
cd
ad
Accuracy  x 100
n
Prof.Gangaboraiah PhD (Stats) | Slide 118 Former Professor of Statistics | KIMS, B’lore
Prof.Gangaboraiah PhD (Stats) | Slide 119 Former Professor of Statistics | KIMS, B’lore

You might also like