Professional Documents
Culture Documents
2
Objective : To determine whether maternal anthropometry
predicts birth weight of the child.
Study participants : 189 normotensive, pregnant women
registered in a gynecology department in a hospital in
Southern India in the year 2012.
Outcome variable : Birth weight in kilograms.
Other variables studied: Mother’s weight (in kg) at first
trimester of pregnancy, mother’s height (in cm), mother’s
age (in years), religion and mother’s smoking status.
3
Scatter plot - Birth weight vs Mothers weight
4
Regression analysis
A statistical technique for investigating and modeling
the relationship between variables
An equation for the relation between dependent and
independent variables
Uses
Prediction & estimation
Adjust for confounders
5
Commonly used regression analysis
• Logistic regression
• Poisson regression
• Cox regression
6
Simple linear regression
E[Birth Weight] = β0 + β1 Mother’s weight
Actual Birth weight = E[Birth Weight] ± ϵ
E[y] = β0 + β1 x1
y = β0 + β1 x + ϵ
• Error term and outcome variable are normally
distributed
• Error term has constant variance for any given level
of independent variable
Eg. E[Birth weight] = - 0.449 + 0.052 Mother’s weight
How to interpret
these? 7
Research questions?
Predict birth weight based on mothers weight & height
Find the relation between birth weight and mothers
weight after adjusting for mothers height
Find the relation between birth weight and mothers
height after adjusting for mothers weight
Multiple linear regression
More than one independent variable
8
Multiple linear regression
E[y] = β0 + β1 x1 + β2 x2 + …+ βp-1 xp-1
y= β0 + β1 x1 + β2 x2 + …+ βp-1 xp-1 + ϵ
• Error term and outcome variable are normally
distributed
• Error term has constant variance for any given level of
independent variable
Eg. E[Birth weight]
= - 3.433 - 0.002 Age + 0.042 Weight + 0.022 Height
How do you interpret these coefficients?
9
Interpretation of regression coefficients
10
Qualitative independent variables
Qualitative or categorical predictor variable
- Mothers smoking status
What is the effect of mothers smoking habit on birth
weight?
What is the effect of mothers weight on birth weight after
adjusting for mothers smoking habit?
How to deal with qualitative (categorical) predictors in
multiple regression?
Dummy variables (indicator variables)
11
Why Dummy variables?
Smoking status = 0 if mother is a non-smoker
= 1 if mother is an ex-smoker
= 2 if mother is a current smoker
What will happen if we use the original coding's instead
of dummy variables?
The average difference in birth weight between
current smoker and ex-smoker is same as that between
ex-smoker and non-smoker
ANOVA F test
Which specific regressors seem important?
t test
14
Multiple linear regression for birth
weight data
Dependent variable – Birth weight
Independent variables – Mothers age, weight &
height
P<0.001
15
Multiple linear regression for birth weight data
Contd..
16
Coefficient of multiple determination (R2)
𝑆𝑆𝑅 𝑆𝑆𝐸
2
R = =1− , 0≤ R2 ≤1
𝑆𝑆𝑇 𝑆𝑆𝑇
17
Multicollinearity
• Near linear dependence among predictors
Effects of multicollinearity
Inflate the variance of the least squares
estimators → increases SE(𝛽 )
18
Multicollinearity diagnosis
Variance Inflation Factor (VIF)
Measures how much the variances of the estimated
regression coefficients are inflated as compared to when
the predictor variables are not linearly related.
1
VIF
1 Rj
2
20
Objective : To determine risk factors for low birth weight
Study participants : 189 normotensive, pregnant women
registered in a gynecology department in a hospital in
Southern India in the year 2012.
Outcome variable : Birth weight <2.5 kg as low and birth
weight ≥ 2.5 kg as normal.
Logistic regression
Binary
Multinomial
Ordinal
22
Binary Logistic Regression
Outcome variable is dichotomous or binary
Eg. Dead/alive, case/controls, side effects (Y/N),
good/poor outcome etc.
Independent variables can be categorical as well as
continuous
Eg. Gender, age, treatment group, severity of disease,
comorbid conditions etc.
23
A hospital based case control study was conducted to
identify risk factors for myocardial infarction (MI).
24
Binary Logistic Regression
Dependent variable
1 If the i th subject possess event
Yi
0 otherwise
Independent variable
1 If the i th subject is in exposed group
Xi
0 otherwise
25
Binary Logistic Regression
Dependent variable 𝑌𝑖 are independent Bernoulli
random variables with probabilities
𝑃(𝑌𝑖 = 1) = 𝑃𝑖 and
𝑃(𝑌𝑖 = 0) = 1 − 𝑃𝑖
Mean of 𝑌𝑖 , 𝐸[𝑌𝑖 ] = 𝑃𝑖
Variance of 𝑌𝑖 , 𝑉𝑎𝑟[𝑌𝑖 ] = 𝑃𝑖 (1 − 𝑃𝑖 )
26
Simple Binary Logistic Regression
Simple because of only one independent variable
The form of simple logistic regression model is
exp(𝛽0 + 𝛽1 𝑋𝑖 )
𝑃(𝑌𝑖 = 1) = = 𝑃𝑖 = 𝐸[𝑌𝑖 ]
1 + exp(𝛽0 + 𝛽1 𝑋𝑖 )
27
Simple Binary Logistic Regression
exp(𝛽0 + 𝛽1 𝑋𝑖 )
𝑃(𝑌𝑖 = 1) = 𝑃𝑖 =
1 + exp(𝛽0 + 𝛽1 𝑋𝑖 )
1
1 − 𝑃(𝑌𝑖 = 1) = 1 − 𝑃𝑖 =
1 + exp(𝛽0 + 𝛽1 𝑋𝑖 )
𝑃(𝑌𝑖 = 1)
= exp(𝛽0 + 𝛽1 𝑋𝑖 )
1−𝑃(𝑌𝑖 = 1)
28
Simple Binary Logistic Regression
ODDS = exp(𝛽0 + 𝛽1 𝑋𝑖 )
Loge(ODDS) = 𝛽0 + 𝛽1 𝑋𝑖
Logit Linear
𝑃𝑖
• Logit = 𝑙𝑜𝑔𝑒
1−𝑃𝑖
• Loge(ODDS) = 𝛽0 + 𝛽1 𝑋𝑖
exp(𝑙𝑜𝑔𝑖𝑡)
• 𝑃𝑖 =
1+exp(𝑙𝑜𝑔𝑖𝑡)
30
Link functions
Function of the dependent variable that gives a linear
function of independent variables
31
Interpretation of regression
coefficients
1 If the i th subject is in exposed group
Let Xi
0 otherwise
Logit(x=1) = 𝛽0 + 𝛽1 × 1
Logit(x=0) = 𝛽0 + 𝛽1 × 0
Logit(X=1) −Logit(X=0) = 𝛽1
Loge(Odds for y=1| X=1)−Loge(Odds for y=1| X=0) = 𝛽1
Odds for y=1| X=1
𝐿𝑜𝑔𝑒 =𝛽1
Odds for y=1| X=0
32
Interpretation of regression
coefficients
X – binary independent variable
Odds ratio = 𝑒 𝛽1
33
95% Confidence interval for𝛽0 and 𝛽1
𝛽1 ± 1.96𝑆. 𝐸(𝛽1 )
𝛽0 ± 1.96𝑆. 𝐸(𝛽0 )
𝑒 𝐿𝐿 , 𝑒 𝐿𝐿 gives the CI for OR
If CI includes one then the OR is not significant
Examples
1) 95% CI for OR (1.8, 4.6) 𝑂𝑅 and 𝛽1 significant
2) 95% CI for OR (0.8, 4.6) 𝑂𝑅 and 𝛽1 not significant
3) 95% CI for OR (0.4, 0.6) 𝑂𝑅 and 𝛽1 significant
34
Interpretation of 𝛽1
35
Interpretation of intercept 𝛽0
When X = 0, logit = 𝛽0
36
Test of significance for the coefficients
using Wald test
Null hypothesis H0: 𝛽1 =0 equivalent of H0: OR = 1
Alternate hypothesis H1: 𝛽1 ≠ 0
Wald test
𝛽1
Test statistic =
𝑆.𝐸(𝛽1 )
38
Polytomous independent variable
Xdaily = 1 if belongs to daily smoker
= 0 otherwise
Xoccas = 1 if belongs to occasional smoker
= 0 otherwise
Logit function is
Logit = 𝛽0 + 𝛽1 𝑋𝑑𝑎𝑖𝑙𝑦 + 𝛽2 𝑋𝑜𝑐𝑐𝑎𝑠
39
Polytomous independent variable
Logit = 𝛽0 + 𝛽1 𝑋𝑑𝑎𝑖𝑙𝑦 + 𝛽2 𝑋𝑜𝑐𝑐𝑎𝑠
40
Polytomous independent variable
𝛽1 is the difference in logits between daily smoker and
never smokers
𝛽2 is the difference in logits between occasional smoker
and never smokers
𝑒 𝛽1 is the odds ratio : odds of getting MI among the daily
smokers compared to that among never smokers
𝑒 𝛽2 is the odds ratio : odds of getting MI among the
occasional smokers compared to that among never
smokers
𝑒 𝛽1−𝛽2 is the odds ratio : odds of getting MI among the
daily smokers compared to that among occasional
smokers 41
Defining reference category
• For dichotomous independent variables, no need of dummy
variables
• For polytomous independent variables, create dummy
variables
• One less than the number of levels in the independent
variable
• The omitted level is the reference category
• Generally choose the level that is least at risk of being a case
• Reference category should have adequate frequency
42
Interpretation of regression coefficient for
continuous independent variable
X – continuous independent variable
One unit change may not be clinically relevant (for eg. age, sbp)
43
Interpretation of regression coefficient for
continuous independent variable
Outcome variable is MI and independent variable is SBP
May not be true as the risk for a 120 mmHg vs 125 mmHg differs
from that of 160 mmHg vs 165 mmHg.
44
Multiple Logistic Regression
Multiple independent variables
Gives the adjusted odds ratios
The model is
𝑃(𝑌𝑖 = 1|X)
exp(𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 )
= = 𝑃𝑖
1 + exp(𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝 )
= 𝐸[𝑌𝑖 ]
𝜋𝑖
Logit = 𝑙𝑜𝑔𝑒 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + ⋯ + 𝛽𝑝 𝑋𝑖𝑝
1−𝜋𝑖
45
Interpretation of coefficients
For example, effect of smoking after adjusting for other factors like
gender, age, obese etc.
46
Testing the significance of overall model
• Likelihood ratio test
• Null hypothesis H0: 𝐴𝑙𝑙𝛽′ 𝑠equaltozero
• Alternate hypothesis H0: 𝐴𝑡𝑙𝑒𝑎𝑠𝑡𝑜𝑛𝑒𝛽𝑛𝑜𝑡𝑒𝑞𝑢𝑎𝑙𝑡𝑜𝑧𝑒𝑟𝑜
The test statistic is
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑𝑤𝑖𝑡ℎ𝑜𝑢𝑡𝑎𝑛𝑦𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝐺 = −2𝑙𝑜𝑔
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑𝑤𝑖𝑡ℎ𝑎𝑙𝑙𝑝𝑣𝑎𝑟𝑎𝑖𝑏𝑙𝑒𝑠
47
Testing the significance of individual coefficients
𝛽𝑗
Test statistic =
𝑆.𝐸(𝛽𝑗 )
48
Best model
The one which best explain the data with minimum number of
variables
49
How to achieve this goal?
We must have
50
Model building steps
1) Univariate analysis
2) Selecting the candidate variables for multivariable analysis
3) Verification of importance of each variable in the model
4) Check assumptions of linearity (for continuous
independent variables)
5) Check for possible interaction between variables
6) Check for the adequacy (goodness of fit) of the final
model
51
Summary measures of goodness of fit
Give an overall indication of fit of the model
A measure of difference between the observed and predicted
values
A large value indicates a problem with the model
Deviance residual
Hosmer-Lemeshow tests of goodness of fit
Classification table
Area under the ROC
52
Hosmer-Lemeshow test
1) Obtain the estimated probabilities for all the n subjects
2) Group the data in to ten different parts based on these probabilities.
3) Obtain the Hosmer-Lemeshow goodness of fit statistic
𝑔
𝑂𝑘 − 𝑛𝑘 𝜋𝑘 2
𝐶 =
𝑛𝑘 𝜋𝑘 (1 − 𝜋𝑘
𝑘=1
Where 𝑛𝑘 is the total number of subjects in the kthdecile group,
𝐶
𝑘
𝑂𝑘 = 𝑗=1 𝑦𝑗 is number of events in the kthdecile group, 𝐶𝑘 is number of
covariate patterns in the kthdecilegroup and
𝐶𝑘 𝑚𝑗 𝜋𝑗
𝜋𝑘 = 𝑗=1 𝑛𝑘
Under null hypothesis of correct fit, C follows Chi square distribution with
g-2 degrees of freedom.
53
Thank you
54