You are on page 1of 54

Linear & Logistic Regressions

Dr. BINU V.S, Ph.D


Assistant Professor
Department of Biostatistics
Contents
Introduction to Linear Regression
 Simple Linear regression
 Multiple linear regression

 Introduction to Logistic Regression


 Simple Logistic regression
 Multiple Logistic regression

 Practical's using SPSS

2
Objective : To determine whether maternal anthropometry
predicts birth weight of the child.
Study participants : 189 normotensive, pregnant women
registered in a gynecology department in a hospital in
Southern India in the year 2012.
Outcome variable : Birth weight in kilograms.
Other variables studied: Mother’s weight (in kg) at first
trimester of pregnancy, mother’s height (in cm), mother’s
age (in years), religion and mother’s smoking status.

3
Scatter plot - Birth weight vs Mothers weight

4
Regression analysis
 A statistical technique for investigating and modeling
the relationship between variables
 An equation for the relation between dependent and
independent variables

Uses
 Prediction & estimation
 Adjust for confounders

5
Commonly used regression analysis

• Simple linear regression

• Multiple linear regression

• Logistic regression

• Poisson regression

• Cox regression

6
Simple linear regression
E[Birth Weight] = β0 + β1 Mother’s weight
Actual Birth weight = E[Birth Weight] ± ϵ
E[y] = β0 + β1 x1
y = β0 + β1 x + ϵ
• Error term and outcome variable are normally
distributed
• Error term has constant variance for any given level
of independent variable
Eg. E[Birth weight] = - 0.449 + 0.052 Mother’s weight

How to interpret
these? 7
Research questions?
 Predict birth weight based on mothers weight & height
 Find the relation between birth weight and mothers
weight after adjusting for mothers height
 Find the relation between birth weight and mothers
height after adjusting for mothers weight
 Multiple linear regression
 More than one independent variable

8
Multiple linear regression
E[y] = β0 + β1 x1 + β2 x2 + …+ βp-1 xp-1
y= β0 + β1 x1 + β2 x2 + …+ βp-1 xp-1 + ϵ
• Error term and outcome variable are normally
distributed
• Error term has constant variance for any given level of
independent variable
Eg. E[Birth weight]
= - 3.433 - 0.002 Age + 0.042 Weight + 0.022 Height
 How do you interpret these coefficients?
9
Interpretation of regression coefficients

 βk indicates the change in the mean response E{Y}


with a unit change in the predictor variable Xk when
all other predictor variables in the regression model
are held constant
 b0 (y-intercept) : If the range of data on x’s includes
zero, then b0 is the mean response when all x’s = 0.
Otherwise b0 has no practical interpretation.

10
Qualitative independent variables
Qualitative or categorical predictor variable
- Mothers smoking status
What is the effect of mothers smoking habit on birth
weight?
What is the effect of mothers weight on birth weight after
adjusting for mothers smoking habit?
How to deal with qualitative (categorical) predictors in
multiple regression?
 Dummy variables (indicator variables)
11
Why Dummy variables?
Smoking status = 0 if mother is a non-smoker
= 1 if mother is an ex-smoker
= 2 if mother is a current smoker
What will happen if we use the original coding's instead
of dummy variables?
 The average difference in birth weight between
current smoker and ex-smoker is same as that between
ex-smoker and non-smoker

→ Not practically or clinically true


12
Dummy variables (indicator variables)
• Binary variables (only two categories coded as 0, 1 or 1, 2 etc.)
• For a qualitative variable with c classes, c – 1 indicator variables
are needed, each taking on the values 0 and 1
• For eg. Smoking status - nonsmoker, Ex-smoker &
current smoker
X1 X2 X1 = 1 if current smoker
1 0 = 0 otherwise
1 0 X2 = 1 if ex-smoker
0 1 = 0 Otherwise
1 0 Reference = Nonsmoker
0 0
etc. 13
Hypothesis testing in multiple linear regression
 Whether the model is statistically significant?

 ANOVA F test
 Which specific regressors seem important?

 t test

14
Multiple linear regression for birth
weight data
Dependent variable – Birth weight
Independent variables – Mothers age, weight &
height

P<0.001
15
Multiple linear regression for birth weight data
Contd..

Unstandardized Standar. 95% Confidence


Coefficients Coeffi. Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound

(Constant) -3.433 .518 -6.630 .000 -4.454 -2.411

Mom_wt .042 .003 .641 13.491 .000 .036 .048


Mom_Age -.002 .005 -.017 -.426 .671 -.012 .008
mom_ht .022 .003 .299 6.316 .000 .015 .028

Birth weight = - 3.433 + 0.042 Weight - 0.002 Age + 0.022 Height

16
Coefficient of multiple determination (R2)

Measures the strength of linear association


between response and predictors.

𝑆𝑆𝑅 𝑆𝑆𝐸
2
R = =1− , 0≤ R2 ≤1
𝑆𝑆𝑇 𝑆𝑆𝑇

Proportion of variation in the dependent


variable explained by predictors

17
Multicollinearity
• Near linear dependence among predictors

Effects of multicollinearity
 Inflate the variance of the least squares
estimators → increases SE(𝛽 )

18
Multicollinearity diagnosis
Variance Inflation Factor (VIF)
 Measures how much the variances of the estimated
regression coefficients are inflated as compared to when
the predictor variables are not linearly related.

1
VIF 
1 Rj
2

𝑅𝑗 2 is the coefficient of multiple determination when xj is


regressed on the remaining predictor variables
VIF >5 indicates multicollinearity
19
Logistic Regression

20
Objective : To determine risk factors for low birth weight
Study participants : 189 normotensive, pregnant women
registered in a gynecology department in a hospital in
Southern India in the year 2012.
Outcome variable : Birth weight <2.5 kg as low and birth
weight ≥ 2.5 kg as normal.

 Categorical binary outcome variable.


Other variables studied : Mother’s weight (in kg) at first
trimester of pregnancy, mother’s height (in cm), mothers
age (in years), religion and mother’s smoking status.
21
Binary outcome variable
• Error and outcome variable not distributed
normally
• Error term does not have constant variance

Logistic regression
 Binary
 Multinomial
 Ordinal

22
Binary Logistic Regression
 Outcome variable is dichotomous or binary
Eg. Dead/alive, case/controls, side effects (Y/N),
good/poor outcome etc.
 Independent variables can be categorical as well as
continuous
Eg. Gender, age, treatment group, severity of disease,
comorbid conditions etc.

23
A hospital based case control study was conducted to
identify risk factors for myocardial infarction (MI).

24
Binary Logistic Regression
Dependent variable


1 If the i th subject possess event
Yi  

0 otherwise

Independent variable

1 If the i th subject is in exposed group
Xi  

0 otherwise

25
Binary Logistic Regression
Dependent variable 𝑌𝑖 are independent Bernoulli
random variables with probabilities

𝑃(𝑌𝑖 = 1) = 𝑃𝑖 and
𝑃(𝑌𝑖 = 0) = 1 − 𝑃𝑖

Mean of 𝑌𝑖 , 𝐸[𝑌𝑖 ] = 𝑃𝑖
Variance of 𝑌𝑖 , 𝑉𝑎𝑟[𝑌𝑖 ] = 𝑃𝑖 (1 − 𝑃𝑖 )

26
Simple Binary Logistic Regression
 Simple because of only one independent variable
 The form of simple logistic regression model is

exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )
𝑃(𝑌𝑖 = 1) = = 𝑃𝑖⁡ = ⁡ 𝐸[𝑌𝑖 ]
1 + exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )

 The predicted value in logistic regression is the


probability that an individual have the event under study

27
Simple Binary Logistic Regression

exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )
𝑃(𝑌𝑖 = 1) = 𝑃𝑖⁡ =
1 + exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )
1
1 − 𝑃(𝑌𝑖 = 1) = 1 − 𝑃𝑖⁡ =
1 + exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )

𝑃(𝑌𝑖 = 1)⁡
= exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )
1−𝑃(𝑌𝑖 = 1)

ODDS of occurrence of the event

28
Simple Binary Logistic Regression

ODDS = exp⁡(𝛽0 + 𝛽1 𝑋𝑖 )

Loge(ODDS) = 𝛽0 + 𝛽1 𝑋𝑖

Logit Linear

 Logit is linear in parameters and continuous and may


range from −∞⁡𝑡𝑜 + ∞ depending on the range of X
29
Logit

𝑃𝑖⁡
• Logit = 𝑙𝑜𝑔𝑒
1−𝑃𝑖⁡

• Loge(ODDS) = 𝛽0 + 𝛽1 𝑋𝑖

exp⁡(𝑙𝑜𝑔𝑖𝑡)
• 𝑃𝑖⁡ =
1+exp⁡(𝑙𝑜𝑔𝑖𝑡)

30
Link functions
 Function of the dependent variable that gives a linear
function of independent variables

 Logistic regression link function is logit function


𝑃(𝑦=1)
link function = 𝑙𝑜𝑔𝑒
1−𝑃(𝑦=1)

Linear regression link function is identity function (y=y)

31
Interpretation of regression
coefficients
1 If the i th subject is in exposed group
Let Xi  
0 otherwise
Logit(x=1) = 𝛽0 + 𝛽1 × 1
Logit(x=0) = 𝛽0 + 𝛽1 × 0
Logit(X=1) −Logit(X=0) = 𝛽1
Loge(Odds for y=1| X=1)−Loge(Odds for y=1| X=0) = 𝛽1
Odds for y=1| X=1
𝐿𝑜𝑔𝑒 =𝛽1
Odds for y=1| X=0

32
Interpretation of regression
coefficients
X – binary independent variable

Odds for y=1| X=1


𝐿𝑜𝑔𝑒 =𝛽1
Odds for y=1| X=0

𝐿𝑜𝑔𝑒 (odds ratio) = 𝛽1

Odds ratio = 𝑒 𝛽1

33
95% Confidence interval for𝛽0 and 𝛽1
𝛽1 ± 1.96⁡𝑆. 𝐸(𝛽1 )
𝛽0 ± 1.96⁡𝑆. 𝐸(𝛽0 )
 𝑒 𝐿𝐿 , 𝑒 𝐿𝐿 gives the CI for OR
If CI includes one then the OR is not significant
Examples
1) 95% CI for OR (1.8, 4.6)  𝑂𝑅 and 𝛽1 significant
2) 95% CI for OR (0.8, 4.6)  𝑂𝑅 and 𝛽1 not significant
3) 95% CI for OR (0.4, 0.6)  𝑂𝑅 and 𝛽1 significant

34
Interpretation of 𝛽1

𝛽1 <0  Odds Ratio <1

𝛽1 >0  Odds Ratio >1

𝛽1 =0  Odds Ratio =1

35
Interpretation of intercept 𝛽0
 When X = 0, logit = 𝛽0

 𝑒 𝛽0 gives the odds of developing the event when x=0

i.e. odds of developing the event in the reference group

36
Test of significance for the coefficients
using Wald test
Null hypothesis H0: 𝛽1 =0 equivalent of H0: OR = 1
Alternate hypothesis H1: 𝛽1 ≠ 0
Wald test

𝛽1
Test statistic =
𝑆.𝐸(𝛽1 )

 Under H0 the test statistic follows standard normal


distribution.
 Similarly for 𝛽0
37
Interpretation of regression coefficients for
polytomous independent variable

X – Polytomous independent variable

Eg. Smokers - 3 categories

X = 2 for daily smokers

= 1 for occasional smokers

= 0 for never smokers → Reference category

 Two dummy variables are needed

38
Polytomous independent variable
Xdaily = 1 if belongs to daily smoker
= 0 otherwise
Xoccas = 1 if belongs to occasional smoker
= 0 otherwise

Logit function is
Logit = 𝛽0 + 𝛽1 𝑋𝑑𝑎𝑖𝑙𝑦 + 𝛽2 𝑋𝑜𝑐𝑐𝑎𝑠

39
Polytomous independent variable
Logit = 𝛽0 + 𝛽1 𝑋𝑑𝑎𝑖𝑙𝑦 + 𝛽2 𝑋𝑜𝑐𝑐𝑎𝑠

Logit function for individuals in daily smoker group


Logit = 𝛽0 + 𝛽1 × 1
Logit function for individuals in occasional smoker group
Logit = 𝛽0 + 𝛽2 × 1
Logit function for individuals in never smoker group
Logit = 𝛽0
 Reference category = Never smoker

40
Polytomous independent variable
𝛽1 is the difference in logits between daily smoker and
never smokers
 𝛽2 is the difference in logits between occasional smoker
and never smokers
 𝑒 𝛽1 is the odds ratio : odds of getting MI among the daily
smokers compared to that among never smokers
𝑒 𝛽2 is the odds ratio : odds of getting MI among the
occasional smokers compared to that among never
smokers
𝑒 𝛽1−𝛽2 is the odds ratio : odds of getting MI among the
daily smokers compared to that among occasional
smokers 41
Defining reference category
• For dichotomous independent variables, no need of dummy
variables
• For polytomous independent variables, create dummy
variables
• One less than the number of levels in the independent
variable
• The omitted level is the reference category
• Generally choose the level that is least at risk of being a case
• Reference category should have adequate frequency

42
Interpretation of regression coefficient for
continuous independent variable
X – continuous independent variable

Eg. Age, cholesterol level, weight etc.

 Assumption is that the logit is linear in the continuous


variable

 𝛽1 is the change in log odds or logit for a unit change in X

 One unit change may not be clinically relevant (for eg. age, sbp)

 ‘c’ units change in X  c𝛽1 and odds ratio is 𝑒 𝑐𝛽1

43
Interpretation of regression coefficient for
continuous independent variable
Outcome variable is MI and independent variable is SBP

𝛽1 = 0.225 and the OR for an increase of 5 mmHg is


𝑂𝑅 = ⁡𝑒 5×0.225 = 3.1

 For every increase of 5 units in SBP the risk of MI increases by 3


times

 May not be true as the risk for a 120 mmHg vs 125 mmHg differs
from that of 160 mmHg vs 165 mmHg.

44
Multiple Logistic Regression
 Multiple independent variables
Gives the adjusted odds ratios
The model is

𝑃(𝑌𝑖 = 1|X)
exp⁡(𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 )
= = 𝑃𝑖⁡
1 + exp⁡(𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 )
= ⁡ 𝐸[𝑌𝑖 ]
𝜋𝑖⁡
Logit = 𝑙𝑜𝑔𝑒 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝
1−𝜋𝑖⁡

45
Interpretation of coefficients
For example, effect of smoking after adjusting for other factors like
gender, age, obese etc.

Logit= 𝛽0 + 𝛽1 𝑠𝑚𝑜𝑘𝑒 + 𝛽2 𝐴𝑔𝑒 + 𝛽3 ⁡𝑆𝑒𝑥 + 𝛽4 𝑜𝑏𝑒𝑠𝑒

 𝛽1 is the expected difference in log odds (logit) between smokers


and never smokers when all the other variables in the model are
held constant
 𝑒 𝛽1 is the odds of developing the MI among smokers compared to
that among never smokers when all other variables in the model
are held constant
 Adjusted odds ratios

46
Testing the significance of overall model
• Likelihood ratio test
• Null hypothesis H0: 𝐴𝑙𝑙⁡𝛽′ 𝑠⁡equal⁡to⁡zero
• Alternate hypothesis H0: 𝐴𝑡𝑙𝑒𝑎𝑠𝑡⁡𝑜𝑛𝑒⁡𝛽⁡𝑛𝑜𝑡⁡𝑒𝑞𝑢𝑎𝑙⁡𝑡𝑜⁡𝑧𝑒𝑟𝑜
 The test statistic is
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑⁡𝑤𝑖𝑡ℎ𝑜𝑢𝑡⁡𝑎𝑛𝑦⁡𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝐺 = −2⁡𝑙𝑜𝑔
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑⁡𝑤𝑖𝑡ℎ⁡𝑎𝑙𝑙⁡𝑝⁡𝑣𝑎𝑟𝑎𝑖𝑏𝑙𝑒𝑠

 Under H0 , the test statistic follows Chi square distribution with


p degrees of freedom.

47
Testing the significance of individual coefficients

Null hypothesis H0: 𝛽𝑗 =0


Alternate hypothesis H1: 𝛽𝑗 ≠ 0
Wald test

𝛽𝑗
Test statistic =
𝑆.𝐸(𝛽𝑗 )

 Under H0, the test statistic follows standard


normal distribution

48
Best model
The one which best explain the data with minimum number of
variables

 A model with minimum number of variables is more likely to


be numerically stable and more easily generalized

 The more the number of variables in the model, the greater is


the estimated standard errors of the coefficients and the
model becomes more dependent on the observed data

49
How to achieve this goal?
We must have

 A strategy to select appropriate variables

 A set of methods for assessing the adequacy of the


model in terms of its individual variables and overall fit

50
Model building steps
1) Univariate analysis
2) Selecting the candidate variables for multivariable analysis
3) Verification of importance of each variable in the model
4) Check assumptions of linearity (for continuous
independent variables)
5) Check for possible interaction between variables
6) Check for the adequacy (goodness of fit) of the final
model

51
Summary measures of goodness of fit
 Give an overall indication of fit of the model
 A measure of difference between the observed and predicted
values
 A large value indicates a problem with the model

Commonly used measures are

 Pearson Chi square residual statistic

 Deviance residual
 Hosmer-Lemeshow tests of goodness of fit
 Classification table
 Area under the ROC
52
Hosmer-Lemeshow test
1) Obtain the estimated probabilities for all the n subjects
2) Group the data in to ten different parts based on these probabilities.
3) Obtain the Hosmer-Lemeshow goodness of fit statistic
𝑔
𝑂𝑘 − 𝑛𝑘 𝜋𝑘 2
𝐶 =⁡
𝑛𝑘 𝜋𝑘 (1 − 𝜋𝑘
𝑘=1
Where 𝑛𝑘 is the total number of subjects in the kthdecile group,
𝐶
𝑘
𝑂𝑘 = ⁡ 𝑗=1 𝑦𝑗 is number of events in the kthdecile group, 𝐶𝑘 is number of
covariate patterns in the kthdecilegroup and
𝐶𝑘 𝑚𝑗 𝜋𝑗
𝜋𝑘 = 𝑗=1 𝑛𝑘

Under null hypothesis of correct fit, C follows Chi square distribution with
g-2 degrees of freedom.

53
Thank you

54

You might also like