Professional Documents
Culture Documents
Calculus Chapter 5
Calculus Chapter 5
R-code:
Challenger<-data. frame(temp=c(53,57,58,63,66,67,68,69,70,72,73,75,76,78,79,81),
n td=c(1,1,1,1,0,0,0,0,2,0,0,1,0,0,0,0), n=c(1,1,1,1,1,3,1,1,4,1,1,2,2,1,1,1))
Challenger
Challenger.lg<-glm(ntd/n~temp, weights=n, family=binomial( ),data= Challenger)
Challenger.lg
summary(Challenger.lg)
R output
b) Estimate the probability of thermal distress at 31F, the temperature at the time
of the Challenger flight.
Ans: according to the model ( x) exp( x)
1 exp( x)
exp(15.043−0.232×31)
𝜋̂(31) = 1+exp(15.043−0.232×31)= 0.9996=99.96%
f) Conduct a Wald confidence interval for the odds ratio corresponding to a 1-unit
increase in temperature. Interpret.
Logistic regression with categorical predictors
In regression analysis, the way to include CATEGORICAL variables as explanatory
variables in the model is using dummy variables, i.e. indicator variables. An indicator
variable usually takes two values:
1 -- in a particular category
0-- not in this category
Consider the following example, a nominal variable “color” with four categories:
This difference between two logits equals the difference of log odds.
Therefore,exp(𝛽1 ) is the conditional odds ratio between X and Y.
That is, controlling for Z, the odds of “success” at x=1 equals exp(𝛽1 )
times the odds of success at x=0.
In the study, 338 veterans whose immune systems were beginning to falter
after infection with the AIDS virus were randomly assigned either to
receive AZT immediately or to wait until their T cells showed severe immune
weakness. Table below is a 2 × 2 × 2 cross classification of veteran’s
race, whether AZT was given immediately, and whether AIDS symptoms
developed during the 3 year study, Let X=AZT treatment, Z=race, and
Y=whether AIDS symptoms developed (1=yes,0=no).
immediately and x=0 otherwise, and let z=1 for whites and z=0 for blacks.
> table.4.4<-expand.grid(AZT=factor(c("Yes","No"),levels=c("No","Yes")),
Race=factor(c("White","Black"),levels=c("Black","White")))
> table.4.4<-data.frame(table.4.4,Yes=c(14,32,11,12), No=c(93,81,52,43))
> options(contrasts=c("contr.treatment","contr.poly"))
> summary(fit<-glm(cbind(Yes,No)~AZT+Race,family=binomial, data=table.4.4))
X
where the term i in this model is a simple way of representing
X X X
1 1 x 2 x2 x
I 1 I 1
The Ith Category does not need an indicator variable, because we know an
observation is in category I when x1 x2 xI 1 0.
That is, all the differences between these coefficients are 0, then the
odds ratio between any two categories are exp(0)=1.
n112 n122
k =2
n212 n222
……
n11k n12k
k =K
n21k n22k
There are three main statistical analyses for a 2×2×K three-way X-Y-Z table:
This test conditions on the row totals and the column totals in each partial table.
Then, as in Fisher’s exact test, the count in the first row and first column in a partial
table determines all the other counts in that table.
This conditioning results in a hypergeometric distribution for the count n11k .
m r m r (n r ) n m
E( X ) , Var ( X )
n n2 n 1
For kth partial table
n11k n12k n1+k
K =k
n21k n22k n2+k
n+1k n+2k n++k
Let X to be the count n11k , then X follows a hypergeometric distribution, where the
parameters n, r, m are n = n++k , r = n1+k , m = n+1k
i.e. X~Hypergeometric (n++k, n1+k, n+1k)
This estimate gives more weight to partial tables with larger sizes.
In AZT use example, the calculated Mantel-Haenszel estimate is 0.4868456.
Consider a logit model 2×2×K table
In this model,
(1) Conditional independence between X and Y corresponds to 𝛽 = 0.
(2) Exp(β) is the common XY odds ratio for each of the K partial tables for categories
of Z.
Breslow-Day test:
A large sample chi-squared test for testing homogeneous association
i.e. testing the null hypothesis that (1) = (2) = = ( ) vs the alternative
hypothesis ( 1) ≠ ( 1) for some 1 ≠ 2 .
Since the odds ratio between X and Y ( ) has nothing to do with Z’s level k, this
model can be called homogeneous association model. For I = 2, if we want to test
the X-Y conditional independence by using the model approach, we simply test
whether 𝛽1 − 𝛽2 is zero or not.
the odds ratio between X and Y ( ) depends on the value of k , so the odds ratio
in each partial table is not necessarily the same.
In fact, for a 2×2×K table, such an inhomogeneous model is the saturated model.
Multiple Logistic Regression
Example: horseshoe crab data with Color and Width Predictors
Continue the analysis of the horseshoe crab data by using both the female crab’s
shell width and color as predictors, where color of collected crabs has only four
categories:
medium light(2), medium(3), medium dark(4), dark(5).
Treat color as a nominal variable and use three indicator variables for the four
categories. Then the model is:
> crab<-read.table(file.choose(),header=TRUE)
> crab$weight<-crab$weight/1000
> crab$psat<-crab$sat>0
> # options(contrasts=c("contr.treatment", "contr.poly"))
# (above command is default setting, not necessary)
>
crab$c.fa<-factor(crab$c,levels=c("5","4","3","2"),labels=c("dark","med-dark","med",
"med-light"))
> crab.fit.logit <- glm(psat ~ c.fa+width, family = binomial, data = crab)
> summary(crab.fit.logit)
R output:
When crab color is dark, c1 = c2 = c3 = 0, the predict equation is:
We observed that the slope on width is always the same: i.e. regardless of color, a
1-cm increase in width has a multiplicative effect of exp(0.468) = 1.60 on the odds of
having a satellite.
Also, at any given width, the estimated odds that a medium crab has a satellite are
exp(1.4023 – 1.1061) = 1.34 times the estimated odds for a medium-dark crab, i.e.
𝑂𝑑𝑑𝑠(𝜋(𝑐𝑜𝑙𝑜𝑟 = 𝑚𝑒𝑑))
= exp(1.4023– 1.1061) = 1.34
𝑂𝑑𝑑𝑠(𝜋(𝑐𝑜𝑙𝑜𝑟 = 𝑚𝑒𝑑_𝑑𝑎𝑟 ))
Below Figure displays the fitted model. Any one curve is any other curve shifted to
the right or to the left. Since the model assumes a lack of interaction between color
and width, any two curves in the figure never cross.
To conduct the likelihood ratio test, use “anova”: (where test = "Chisq" is default)
> anova(crab.fit.logit, test = "Chisq")
The R output:
Here the LR test statistic for 0 β4 = 0is 24.604, P=7.0e-07 shows that width is a
useful variable in the model. (To conduct the LR test for 0 β1 = β2 = β3 = 0, in R,
we have to switch the order of color and width when writing the model. )
Multiple Logistic Regression
Example: horseshoe crab data with interaction
To test whether the width effect changes at each color, we can test the significance
of a color by width interaction by fitting a new model with the addition of this
interaction and comparing the model deviance with that of the previous fit.
The R output;
The R output:
The likelihood-ratio test gives p-value 0.2236 implies that an interaction model is not
necessary here.
Fitting the interaction model is equivalent to fitting the logistic regression model with
width as the predictor separately for the crabs of each color.
Multiple Logistic Regression with Ordinal Predictor
Example: horseshoe crab data
The author Agresti use scores c = {1, 2, 3, 4} for the color categories and fit the model,
we take an easier way to use the existed scores “2, 3, 4, 5”
The R output:
Since we assign different scores to color categories, the reported coefficients are not
same with the textbook, the fitted model here is:
To test the hypothesis that the quantitative color model is adequate given that the
qualitative color model holds, we use command “anova”:
The R output:
Judging by the p-value, we can go with the simpler (fewer parameters) quantitative
color model.
Section 4.5.1 explains how to present the interpretation of the logistic regression to
laymen by using probabilities.
Let’s revisit the Horeseshoe Crab data and try to build a model with all the predictors.
Figure below shows portion of the raw data.
Before building the model, we need do some pretreatment on the raw data:
> crab<-read.table(file.choose(),header=TRUE)
> crab$weight<-crab$weight/1000
> crab$psat<-crab$sat>0
> crab$c.fa<-factor(crab$c, levels=c("5","4","3","2"))
> crab$s.fa<-factor(crab$s, levels=c("3","2","1"))
the R output:
We can test the usefulness of above model by Likelihood Ratio Test.
This shows extremely strong evidence that at least one predictor has an effect.
we can see that the variables width and weight are highly correlated and so should
not be simultaneously included in the model.
In the forthcoming modeling we will remove the variable “weight”.
Model Selection for Horseshoe Crab Data
Other criteria besides significance tests can help select a good model.
The best known is the Akaike information criterion (AIC).
Prediction (̂)
1 0
For example, in the horseshoe crab data, of the 173 crabs, 111 had a satellite,
for a sample proportion of 0.642. Below Table shows classification tables using π0 =
0.50 and π0 = 0.642.
P(correct classification)=(74+42)/173=0.671
Below Figure shows SAS output of the ROC curve for the model for the horseshoe
crabs using width and color as predictors.
“Nevertheless, R is useful for comparing fits of different models for the same data.”
Model Checking
As we have mentioned before, the deviance of a model is the likelihood ratio test
statistic, testing whether our model is as good as the saturated model.
The R command
anova(model1.fit,model2.fit, test="Chisq")
is very useful in comparing the goodness of fit between model1 and model 2. If the
output shows non-significant difference in model fitting, we usually choose the
simpler model.
Goodness of Fit for categorical predictors
When the predictors are solely categorical, the data are summarized by counts in
a contingency table.
2
Then we can have Pearson statistic ( ) and the likelihood ratio chi-squared
2
( ) To test the model fit:
For a fixed number of settings, when the fitted counts are all at least
about 5, X2(M) and G2(M) have approximate chi-squared null
distributions.
However, when at least one explanatory variable is continuous, it is
difficult to analyze lack of fit without some type of grouping.
Let x = 1 for those who took AZT immediately and x = 0 otherwise, and
let z = 1 for whites and z = 0 for blacks. The ML fit is
No matter whether a model provides a good fit or not, it is always sensible to see if
there are any potential problematic data. One indicator is the residuals, which may
tell you whether a datum is an outlier. We also want to know if there are any
influential data.
With categorical predictors, we can use residuals to compare observed and fitted
counts.
When the model holds, {ei } has an approximate expected value of zero but a smaller
variance than a standard normal variate.
i.e.
It is approximately standard normal when the model holds. We prefer it. An absolute
value larger than roughly 2 or 3 provides evidence of lack of fit.
We should carefully use residuals when fitted values are very small, approximation
not hold.
Data input:
> yes<-c(32,21,6,3,12,34,3,4,52,5,8,6,35,30,9,11,6,15,17,4,9,21,26,25,21,7,25,31,3,9,
10,25, 25,39,2,4,3,0,29,6,16,7,23,36,4,10)
> no<-c(81,41,0,8,43,110,1,0,149,10,7,12,100,112,1,11,3,6,0,1,9,19,7,16,10,8,18,37,0,
6,11,53,34,49,123,41,3,2,13,3,33,17,9,14,62,54)
> table.5.5<-cbind(expand.grid(gender=c("female","male"),
dept=c("anth","astr","chem","clas","comm","comp","engl","geog","geol","germ",
"hist","lati","ling","math","phil","phys","poli","psyc","reli","roma","soci",
"stat","zool")),prop=yes/(yes+no))
We may also obtain the G2 = 44.7 and X2 = 40.9, both with df = 23, (P-values = 0.004
and 0.012).
So far I haven’t find the R code for G2 and X2 calculation, but it is not difficult to
obtain the prediction values from “predict”, then we can calculate G2 and X2 without
any more dificulties