Professional Documents
Culture Documents
Session
BITS Pilani11: Multiple Linear Regression Analysis
Bangalore Campus
About me
Dr.Gangaboraiah, PhD (Stats)
Former Professor of Statistics, KIMS, Bangalore
Work Experience
Kempegowda Institute of Medical Sciences, Bangalore (34 years)
Govt. Homeopathy Medical College, Bangalore (4 years)
SJC Institute of Technology, Chickballapur (13 years, Visiting Professor)
Manipal University, Bangalore Centre (Since 2008, Visiting Professor)
MS (Computer Science), MS (Computer Network)
Data Science
BITS (Since 2013, Visiting Professor)
MTech (Data Science)
WIPRO and Aricent (2019)
Prof.Gangaboraiah PhD (Stats) | Slide 3 Former Professor of Statistics | KIMS, B’lore
Agenda
Here’s what you will learn in the entire unit:
1 Data Visualization: Why? What? How?
• Define Correlation
• Define Regression.
• Define Correlation
• Define Regression.
Y
i 1
i nb 0 b1 X1i b 2 X 2i ... b k X ki
i 1 i 1 i 1
n n n n n
...
n n n n n 2
X
i 1
ki Yi b 0 X ki b1 X1i X ki b 2 X 2i X ki ...b k X ki
i 1 i 1 i 1 i 1
Y nβ
i 1
i 0 β1 X1i β 2 X 2i
i 1 i 1
n n n n
X Yi β 0 X 1i β1 X1i β 2 X1i X 2i
2
1i
i 1 i 1 i 1 i 1
n n n n
X Yi β 0 X 2i β1 X1i X 2i β 2 X 2i
2
2i
i 1 i 1 i 1 i 1
Y nb
i 1
i 0 b1 X1i b 2 X 2i
i 1 i 1
n n n n
X Yi b 0 X 1i b1 X1i b 2 X1i X 2i
2
1i
i 1 i 1 i 1 i 1
n n n n
X Yi b 0 X 2i b1 X1i X 2i b 2 X 2i
2
2i
i 1 i 1 i 1 i 1
Ŷ b̂ 0 b̂1X1 b̂ 2 X 2
For given values of X1 and X2 Y can be
predicted
ŷ
X3
X1
X2
1. Compare Groups
Compare Proportions (e.g., Chi Square Test -2)
a.
H0: P1 = P2 = P3 = … = Pk
Y = a + b 1 x1
b. Multivariate (e.g., Multiple Regression Analysis)
Between one dependent variable and each of
several independent variables, while holding
all other independent variables constant:
Y = a + b1 x1 + b2 x2 + b3 x3 + … + bk xk
Prof.Gangaboraiah PhD (Stats) | Slide 32 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
What does regression analysis do?
Examines whether changes/differences in values of one
variable (dependent variable Y) are linked to
changes/differences in values of one or more other
variables (independent variables X1, X2, etc.), while
controlling for the changes in values of all other Xs.
E.g., Relationship between salary and gender for people who have the same levels of education, work
experience, position level, seniority, etc.
Y = a + b1 x1 + b2 x2 + b3 x3 + … + bk xk+є
Therefore, regression analysis, in a sense, is about ESTIMATING
values of Y, using information about
values of Xs:
Estimation, by definition, involves?
The objective?
To minimize error in estimation.
Or, to compute estimates that are as close to the true/actual
values as possible.
y 56
56
i
yˆ y 7
8
Prof.Gangaboraiah PhD (Stats) | Slide 39 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Estimate for # Error in
Number Credit Cards of Credit Cards Estimation
i yi yˆ y yi y
1 4 7 -3
2 6 7 -1
3 6 7 -1
4 7 7 0
5 8 7 +1
6 7 7 0
7 8 7 +1
7 Lets now see all
8 10 +3
this graphically
yi 56 yˆ y
56
8
7
Prof.Gangaboraiah PhD (Stats) | Slide 40 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
10
F8
9
F5
8 F7
F6 ^
7
F4 Y Y Estimat
6 F2, F3 e
5
4 F1
3
2
1
Let’s spread the dots away from each other to
0
see things more clearly!
Prof.Gangaboraiah PhD (Stats) | Slide 41 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
10 F8
Graphic Representation
9
F5
Actual Estimate
8 F7
^
Estimate
7
F4 F6 Y Y
6 F3
F2
5
4 F1
Estimation Error
3
2 Can we determine the
total estimation
1
error for all 8
0 families?
Prof.Gangaboraiah PhD (Stats) | Slide 42 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Family Actual # of Estimate for # Error in
Number Credit Cards of Credit Cards Estimation
i
yi yˆ y yi y
1 4 7 -3
2 6 7 -1
3 6 7 -1 What would
4 7 7 0
be the total
5 8 7 +1
6 7 7 0
estimation
7 8 7 +1 error for all 8
8 10 7 +3 families
yi 56 yˆ y
56
7 ( y y) = 0
i
combined?
8
Prof.Gangaboraiah PhD (Stats) | Slide 43 Solution?
Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Estimating Number of Credit Cards*
Family Actual # of Estimate for # Error in Errors
Number Credit Cards of Credit Cards Estimation Squared
yi yˆ y yi y 2
( yi y )
1 4 7 -3 9
2 6 7 -1 1
3 6 7 -1 1
4 7 7 0 0
5 8 7 +1 1
6 7 7 0 0
7 8 7 +1 1
8 10 7 +3 9
yi 56 yˆ y
56
8
7 ( yi y ) 0 ( yi y ) 22
2 SST = Sum of
Squares Total
• Objective in Estimation?
Minimize error, maximize precision.
5
F3 Estimate
ˆ ( x x)( y y)
b
( x x ) 2
ˆ
yˆ aˆ bx
aˆ y bˆ x
Let’s use above formulas to compute the values of
“a” and “b” for the regression line in our example.
2
Y
56
7
34
x 4.25 ( x x )( y y ) ? ( x x ) 2
?
8 8
Prof.Gangaboraiah PhD (Stats) | Slide 51 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis 2
We need: y, x, ( x x )( y y), and ( x x )
Family Actual # of Family
xx
2
Number Credit Cards Size y y ( x x )( y y ) (x x)
i y x
1 4 2 -2.25 -3 6.75 5.0625
2 6 2 -2.25 -1 2.25 5.0625
3 6 4 -.25 -1 .25 .0625
4 7 4 -.25 0 0 .0625
5 8 5 .75 1 .75 .5625
6 7 5 .75 0 0 .5625
7 8 6 1.75 1 1.75 3.0625
8 10 6 1.75 3 5.25 3.0625
Y
56
7
34
x 4.25 ( x x )( y y ) ? ( x x ) 2
?
8 8
Prof.Gangaboraiah PhD (Stats) | Slide 52 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
REGRESSION LINE (LINE OF BEST FIT):
(x x)(y y) 17
b̂ 0.971
(x x) 2 17.5
ˆ
yˆ aˆ bx
aˆ y b x 7 0.971(4.25) 2.87
aˆ 2.87 bˆ 0.971
yˆ 2.87 0.971x
? ?
Y-Intercept Regression Coefficient
Prof.Gangaboraiah PhD (Stats) | Slide 53 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Y 10 F8 yˆ 2.87 0.971x
9 New
Improved
8
F F Estimates
F F 5
7
2 4
7 y Original
(Baseline)
6
F6 Estimate
5
F
3
4
F
3 Can 1
we tell how much estimation error we have
2 committed by using the new regression line?
Yes, examine differences between our household’s
1
actual # of CCs and their new/regression estimates.
0 X Family
1
Prof.Gangaboraiah PhD (Stats) | 2
Slide 54 3 4 5 Professor
Former 6 of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
ŷ 2.87 0.97 x ŷ
Family Actual # of Family Regression Error Errors
Number Credit Cards Size Estimate (Residual) Squared
y yˆ
2
i y x ŷ ( y yˆ )
1 4 2 ? ? ?
2 6 2 ? ? ?
3 6 4 ? ? ?
4 7 4 ? ? ?
5 8 5 ? ? ?
6 7 5 ? ? ?
7 8 6 ? ? ?
8 10 6 ? ? ?
Prof.Gangaboraiah PhD (Stats) | Slide 55 Former Professor y)ˆ
(ofyStatistics
2
| KIMS, B’lore
Multiple Linear Regression Analysis
ŷ 2.87 0.97 x yˆ 2.87 .97(2) 4.81
Family Actual # of Family Regression Error Errors
Number Credit Cards Size Estimate (Residual) Squared
y yˆ
2
i y x ŷ ( y yˆ )
1 4 2 4.81 -.81 .66
2 6 2 4.81 1.19 1.42
3 6 4 6.76 -.76 .58
4 7 4 6.76 .24 .06
5 8 5 7.73 .27 .07
6 7 5 7.73 -.73 .53
7 8 6 8.7 -.7 .49
8 10 6 8.7 1.3 1.69
( y yˆ ) 5.486
2
SSE = Sum of Squares Error (SS Residual)
Prof.Gangaboraiah PhD (Stats) | Slide 56 Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
Total Baseline Error using the mean (SS Total) 22.0
New or Remaining Error (SS Error or SS Residual) 5.486 ~ 5.5
QUESTION: How much of the original estimation error have we explained Total Var. in
away (eliminated) by using the regression model (instead of the mean)? Y = 22
4 for F1
? y yˆ
F1 New ERROR
3 (Unexplained/
RESIDUAL)
2
1
0 X Family Size
1
Prof.Gangaboraiah PhD (Stats) | 2 Slide 359 4 5 6 7Former Professor of Statistics | KIMS, B’lore
Multiple Linear Regression Analysis
5.5 = SSE = The amount of estimation error for the 8
sample families when using simple regression (i.e., a
regression model that includes only information about
family size).
( y y)
ˆ 2
c b R2Graphically = ?
X1=Family
Size NOTE: c is explained by both X1 and X2
X2 = Family
Income
SSR = a + b +c = 18.95
SST = a + b + c + d = 22
R2 = SSR / SST = (a + b + c) / (a + b + c + d) = 18.95 / 22 = 86%
SSE = ?
Y 0 1 X
1.0, and 1 are parameters
2.X is a independent variable
3.Deviations are independent N(o, )
2
SSE ( yi yˆ i ) y b0 b1 x
2
i 1 i 1
(X i X)(Yi Y) n X i Yi X i Yi Sy
b̂1 or b̂1 r
i 1 i 1 i 1 i 1
n n n
(X i X)
i 1
2
n X ( X i )
i 1
2
i
i 1
2 Sx
and
Y 14365 XY 818755
The least squares estimates of the regression
coefficients are:
n XY X Y 10(818755) (564)(14365)
b̂1 10.8
n X ( X)
2 2
10(32604) (564) 2
2
1 X
SE b0 S(b 0 ) S n
(X X)
n 2
i
i 1
10.8
t 4 .5
2.38
Conclusion:
Since t =4.5 > 2.306 then we reject H0.
There is a linear association between advertising expenditure
and weekly sales.
Prof.Gangaboraiah PhD (Stats) | Slide 86 Former Professor of Statistics | KIMS, B’lore
Confidence Intervals and Significance Tests
b1 t ( S (b1 ))
( ; n 2 )
2
10.8 2.306(2.38)
10.8 5.49 (5.31, 16.3)
Source of
df SS MSS F - ratio
variation
Source of
df SS MSS F - ratio
variation
92427.74 92427.74
Regression 1 20.47
36124.76 4515.6
Error 8
Total 9 128552.5
Prof.Gangaboraiah PhD (Stats) | Slide 91 Former Professor of Statistics | KIMS, B’lore
Analysis of variance approach to
Analysis of variance approach to
Regression analysis Regression analysis
F-Test for 1= 0 versus 1 0
Equivalence of F Test and t Test:
For given level, the F test of 1 = 0 versus 1 0 is equivalent
algebraically to the two sided t-test.
Thus, at a given level, we can use either the t-test or the F-test
for testing 1 = 0 versus 1 0.
The t-test is more flexible since it can be used for one sided test
as well.
Prof.Gangaboraiah PhD (Stats) | Slide 92 Former Professor of Statistics | KIMS, B’lore
Multivariate Regression analysis
Population Model:
Y β 0 β1X1 β 2 X 2 ... β k X k ε i
Sample model:
Y b 0 b1X1 b 2 X 2 ... b k X k ei
Equivalent to testing
H0: population multiple correlation = 0 (or popul. R2 = 0)
vs. Ha: population multiple correlation > 0
H1: 1 0 or 2 0 or both
Test statistic
2
R /k 0.861 / 2
F 15 . 486
(1 R )/[n (k 1)] (1 0.861) /[8 (k 1)]
2
𝑛−2
𝑡=𝑟
1 − 𝑟2
Prof.Gangaboraiah PhD (Stats) | Slide 100 Former Professor of Statistics | KIMS, B’lore
Time interval
Outcome of Marital Method of attempt between attempt
Sl No Age (yrs) Sex Cause of suicide Religion SES Occupation Time of event
suicide status to suicide to suicide and
bring to hospital
1 Died 19 Female Married Dowry death Hindu Middle Housewife Hanging Morning 30 minutes
Failure in
2 Survived 20 Male Unmarried Hindu Middle Student Poisson 30 minutes
studies
Time of death
unknown, hifted to
3 Died 21 Male Unmarried Depression Hindu Lower Painting/ Coolie Electrical Night
hospital for
postmartum
Manager in Cane
4 Died 21 Male Unmarried Depression Hindu Middle Hanging Morning 25 minutess
Juice centre
Time of death
Failure in unknown, hifted to
5 Died 17 Male Unmarried Hindu Middle Student Hanging Night
studies hospital for
postmartum
Time of death
Problem at work unknown, hifted to
6 Died 20 Male Unmarried Hindu Lower Coolie Electrical Evening
place hospital for
postmartum
7 Survived 19 Female Unmarried Pain abdomen Hindu Upper Middle Student Poisson 70 minutes
9 Survived 20 Male Unmarried Depression Hindu Middle Student Fall from height 35 minutes
Time of death
unknown, hifted to
10 Died 20 Male Unmarried Love failure Hindu Middle Student Hanging Morning
hospital for
Prof.Gangaboraiah PhD (Stats) | Slide 101 Former Professor of Statistics | KIMS, B’lore
postmartum
Logistic Regression Analysis
Introduction
In linear regression models it is assumed that
the dependent variable ‘Y’ should be
Quantitative (Continuous/ Discrete) and
normally distributed.
But in many instances, the dependent variable
will not quantitative instead may be categorical.
If the dependent variable is categorical, it
violates the assumption of linearity of normal
regression.
Prof.Gangaboraiah PhD (Stats) | Slide 102 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
A Few Examples:
Consumer chooses brand (1) or not (0);
A quality defect occurs (1) or not (0);
A person is hired (1) or not (0);
Evacuate home during hurricane (1) or not (0);
Other Examples???
Prof.Gangaboraiah PhD (Stats) | Slide 103 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Scatterplot of with Y=(0,1):
Y = Hired-Not Hired; X= Experience
Y
1
0 X
Prof.Gangaboraiah PhD (Stats) | Slide 104 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Linear Probability Model (LPM)
If we estimate the slope using OLS regression:
Hired = α + *Income + e ;
Prof.Gangaboraiah PhD (Stats) | Slide 105 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Picture of LPM
Scatterplot of with Y=(0,1):
Y = Hired-Not Hired; X= Experience
Y
1
LPM Regression Line
(slope coefficient)
Points on regression line represent
predicted probabilities. For Y for each value of X
0 X
Prof.Gangaboraiah PhD (Stats) | Slide 106 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
An Example: Loan Approvals
Data:
Dependent Variable:
Loaned = 1 if Loan Approved
0 if not Approved by Bank Z
Independent Variables:
ROA = net income as % of total assets of applicant;
Debt = debt as % of total assets of applicant;
Officer = 1 if loan handled by loan officer A
0 if handled by officer B;
Prof.Gangaboraiah PhD (Stats) | Slide 107 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Linear Probability Model (LPM) Weaknesses
The predicted probabilities can be greater than 1 or less
than 0
Probabilities, by definition, have max =1; min = 0;
This is not a big issue if they are very close to 0 and 1
The error terms vary based on size of X-variable
(“heteroskedastic”) –
There may be models that have lower variance – more
“efficient”
The errors are not normally distributed because Y takes on
only two values
Creates problems for
More of an issue for statistical theorists
Prof.Gangaboraiah PhD (Stats) | Slide 108 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Picture of LPM fused with S-type curve
Scatterplot of with Y=(0,1):
Y Y = Hired-Not Hired; X= Experience
1
LPM Regression Line
(slope coefficient)
Points on regression line represent
predicted probabilities. For Y for each value of X
0 X
Prof.Gangaboraiah PhD (Stats) | Slide 109 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
The model that describes the S-type curve is as
follows.
Let ‘p’ be the probability that an event ‘Y’
occurs, ie., P(Y=1)
Prof.Gangaboraiah PhD (Stats) | Slide 110 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
The estimated probability P(Y) with one
predictor variable is given by
P (Y ) 1 e ( 0 1X1i ) 1
and
( 0 1X1εi ) ( 0 1X1εi )
1 - P(Y) 1 e
1 e
1
( 0 1X1εi ) e
1 e ( 0 1X1εi )
Prof.Gangaboraiah PhD (Stats) | Slide 111 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
P(Y)
The ratio of is called the Odds
1 P(Y)
Ratio of two probabilities which is given by
P(Y) ( 0 1 X 1 )
e
1 P(Y)
Prof.Gangaboraiah PhD (Stats) | Slide 112 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Relationship between Odds & Probability
Probability (Event)
Odds (Event)
1 Probability (Event)
Odds (Event)
Probability (Event)
1 Odds (Event)
Prof.Gangaboraiah PhD (Stats) | Slide 113 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
The Odds Ratio
Definition of Odds Ratio: Ratio of two odds
estimates.
So, if Pr (response |trt) = 0.40 and Pr (response | placebo) =
0.20
0.40
Then: Odds response| trt group 0.667
1 0.40
Oddsresponse | placebo group
0.20
0.25
1 0.20
0.667
OR Trt vs. Placebo 2.67
0.25
Prof.Gangaboraiah PhD (Stats) | Slide 114 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Model development
P(Y)
The logarithm of 1 P(Y) is called the
P(Y)
The logarithm of 1 P(Y)
is called the
P(Y)
Hence, Ln β β X ε
1 P(Y)
0 1 1
Prof.Gangaboraiah PhD (Stats) | Slide 116 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
Confusion matrix
Outcome of suicide
Sex Survived Died Total
Male 6 41 47
Female 14 52 66
Total 20 93 113
Prof.Gangaboraiah PhD (Stats) | Slide 117 Former Professor of Statistics | KIMS, B’lore
Logistic Regression Analysis
a d
Sensitivit y x 100 Specificit y x 100
ac bd
a
Positive predictive value x 100
ab
d
Negative predictive value x 100
cd
ad
Accuracy x 100
n
Prof.Gangaboraiah PhD (Stats) | Slide 118 Former Professor of Statistics | KIMS, B’lore
Prof.Gangaboraiah PhD (Stats) | Slide 119 Former Professor of Statistics | KIMS, B’lore