Professional Documents
Culture Documents
(Reference book for old course: Logistic Regression Models, Joseph M. Hilbe, 2009
(Reference book for new course from sept 2015: Practical Guide to Logistic Regression, Joseph
Hilbe, 2015)
Note: These discussions have been selected from prior courses and lightly edited to correct typos
and remove student names. A useful way to use this resource is by browsing, or via the index.
WEEK 1:
X0 X1
-----------
Y1 A B
Y0 C D
-----------
A+C B+D
Odds refer to p/(1-p) where p is a probability, An odds ratio is the ratio of this same formula,
with the numerator with respect to 1 and the denominator with respect to 0, where 1 and 0 are
two levels being compared.
With respect the table above, an OR can be characterized as (I'm using 2 for an example)
The fitted value is calculated using the logit inverse link function, 1/(1+exp(-xb)).
We have
xb = 1 : 1/(1+exp(-1)) = .73105858
xb = .5: 1/(1+exp(-.5)) = .62245933
Both fitted values are probabilities, which have a linear relationship with the predictors.
But I am not sure the way he calculates (reference Logistic Regression Models, Joseph M. Hilbe,
2009)
By exponentiation the coefficient of white, we get the odds ratio. Now let's look at what the
output is for a table of odds ratios. Use of eform with the glm command provides exponentiated
coefficients, which for logit models are the Odds Ratios. The nohead option tells Stata to not
show the heading statistics and nolog says to suppress the iteration log.
Yep, .277287 is what we find. We do not get this value directly from the V-C matrix, but the first
component of the product generating the statistic does. Confidence intervals for Odds Ratios
are, however, calculated by exponentiation the respective low and up CIs. Thus, for white,
. di exp(-.0990914)
.90565993
. di exp(.7041167)
2.0220598
Model coefficients and CIs are the base statistics, and are used for other related calculated
statistics.
WEEK 2:
So, b_0 will be log(0.25)=-1.386 and b_1 will be log(4)=1.386. You can try it out
glm(y~1+x, family=binomial)
OK, not let's get the predicted probabilities. The first 50 observations all have x=0, and so the
predicted probability will be
The second 50 observations all have x=1, and so the predicted probability will be
Robust standard errors are objective. In case you didn't know, there are different kinds of
robust standard errors. Each one addresses one or more assumptions of the model, and then
specifically protects your inference for violation of those specific assumptions. Furthermore,
they protect inference from any form of the violation. So long as the conditional means are
meaningful, the inference will be correct. If the model was correct and we really didn't need to
make any correction via robust standard errors, the inference is still correct (estimates are still
consistent) using robust standard errors. The coefficient estimates are just not quite as
efficient. Using robust standard errors is like saying, "I know that the log-likelihood is not
Note how in both cases, we basically said, "Hey, world! My model is wrong." If all we do
afterward is discuss the properties of the regression coefficients that is ok. However, we really
shouldn't carry on in an analysis where we utilize likelihood ratio tests, or calculate residuals
based on likelihood arguments. After all, we already admitted the likelihood is wrong, so we
really shouldn't try to have it both ways. Stata, for example, won't calculate likelihood-based
statistics after you estimate a model using the sandwich (robust) variance estimate. It will,
however, calculate them after using scaled standard errors which isn't really consistent
behavior. Most other packages will calculate whatever you ask for, so this advice from me is a
bit heavy-handed. That said, I do think it wise to consider that using these alternate variance
estimates is a declaration of the model being wrong, and so your toolbox should be limited
afterward. If you really think your model is wrong, then perhaps you should spend a bit of time
trying to find a model that isn't wrong.
Consistent estimates are those that converge to the true parameter value as the sample size
goes to infinity. For example, if I have a statistic J and it is estimating a parameter value P, and
the expected value of J is (P + 1/n) where n=sample size, then you can see as the sample size
goes to infinity, the expected value of J is equal to P. Thus, J consistently estimates P.
Efficiency refers to the relative variance of certain estimators. If two estimates are both
consistent, but one has smaller variance than the other as the sample size increases, then the
estimate with the smaller variance is more efficient. For example, for a sample from a normal
distribution, the sample mean has variance 1/n and the sample median has variance pi/(2*n)
which is approximately equal to 1.57/n. The sample mean is more efficient than the sample
median when it comes to estimating the population mean.
When I say that inference is correct, I mean that when I construct a 95% CI, then the
construction of that interval really would contain the true parameter 95% of the time if we were
to carry out the same experiment repeatedly. I mean that a test of size alpha=0.05 really does
have Type I error equal to 5%.
After I fit the model to these two different value types, the coefficients are different. I do not
include the results here.
I know I can complete Question #3 without ordering 'relig.' However, I am curious what ordering
means. I am guessing that for multinomial regression ordering the dependent variable would
mean something.
Instructor’s Response:
Given that I don't often use R to do my data analysis, I was not aware of the issues that I am
about to address here.
Software packages try and help when they can, but it is our responsibility to know what is
happening. This is especially true when the software makes decisions on our behalf. When I
lecture to a live audience about this, I repeatedly emphasize that you don't want the software to
do your thinking for you. Because of that, my personal choice when using software is to NEVER
let it make decisions for me. I never tell SAS that my variable is a CLASS variable. I never tell
SPSS or R that my variable is a factor variable (ordered or unordered). I sometimes tell Stata
that I want indicators for my variable by specifying the "i." prefix on the variable name, but I use
Stata all of the time, the specification is explicit, and the specification shows up in the
output. OK - rant over.
Here is what happens in R with your variable. Let's say you have a variable F that has 5
levels. What you really want is to include an indicator for 4 out of the 5 levels (leave one level
out as the reference level). I explicitly create indicators for the 4 desired levels like this:
F1 <- 1*(F==1)
F2 <- 1*(F==2)
F3 <- 1*(F==3)
F4 <- 1*(F==4)
F5 <- 1*(F==5)
Then, I explicitly leave out the one that I explicitly want to be my reference level. For example,
like this
If I tell R that my variable F is an "unordered factor", then when I use F in a specification it will
apply
cont.treatment(5)
2 3 4 5
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
glm(y ~ x + z + F, family=binomial)
R chooses the smallest numeric level as the reference which may or may not be the one I want,
but at least I understand what is happening.
contr> contr.poly(5)
.L .Q .C ^4
[1,] -0.6324555 0.5345225 -3.162278e-01 0.1195229
[2,] -0.3162278 -0.2672612 6.324555e-01 -0.4780914
[3,] 0.0000000 -0.5345225 -4.095972e-16 0.7171372
[4,] 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
[5,] 0.6324555 0.5345225 3.162278e-01 0.1195229
As you can see, this is not the simple generation of indicator variables. Rather, it is treating
these levels as ordered and making various contrasts. If I really wanted to do this, I would
explicitly create my own variables. Those are the terms that are automatically generated on my
behalf when F is an "ordered factor" and I estimate a model
glm(y ~ x + z + F, family=binomial)
If I want to explicitly generate the variables for this model, then this is how you do it:
This gets the same output, but it does so under my explicit guidance. I know exactly where
those factors came from. I (might) even know what they mean. The point is, had I not known
that R was doing the above, I might have thought it was creating indicator variables. Even if I
knew it wasn't doing that, if I hadn't looked through the help files on the default treatment of
ordered and unordered factors, I might not have known that the treatments differ.
OK - now, finally, I am going to answer your questions; sorry for the long aside leading us to
here.
It does not. That is, if you say nothing about the nature of a variable, then R will treat it as a
numeric variable. If you tell R to treat a variable as a "ordered factor" or an "unordered factor",
then you get the above treatment which you may or may not want.
2. In R, there are ordered and unordered factors. What does ordering a factor mean
anything to binomial logistic regression? I know it does make a difference to the results,
because I tried it (results not shown here). However, I am not sure how to interpret the
differences.
Please see the answer above. It is rather complicated and leads me to my most important
advice: always explicitly construct the variables that you want in your model and then explicitly
include them. It makes your life so much easier in the end even if it means that you have to
spend tons of time generating variables!
The nonlinear mapping is what bothers me. Could you give me some insight into how things like
confidence intervals for µ on the LHS of our model are related to linear model values on the
RHS? If you could point me to a reference on how the algorithms work, that would be enough. I
know this is kind of an open-ended question, but I am just curious how it all works.
Instructor’s Response:
Here is how I would write the summary:
We want to estimate the conditional mean of an outcome given covariates, and that conditional
mean is always in (0,1), and the outcome is always zero or one. So, we use the Bernoulli
because that distribution has these properties.
Rather than trying to specify a really complicated relationship between covariates (which is
another viable alternative), I will simply transform the linear predictor (X*beta=η) to the
Subject: Questions
Participant’s Post:
Questions (PGLR book)
Page 65: The bootse values appear to be identical to the se values. Looking into the
bootstrapped values of bse$t, these are identical across the 100 iterations, and my
understanding of the code in the function is that I will simply reproduce the bootmod model. Is
there something that should be varied in each iteration of the function?
Page 72-3 LR test and drop1() function. It might seem obvious, but I'm assuming that one
should consider dropping predictors where models with predictor dropped are insignificantly
different to models with the predictor. (It might be worth saying this in the text).
Page 74: Anscombe residuals: The a and b parameters for the beta distribution are both set to
2/3. Would this always be appropriate?
Page 85: First line. Is the ROC statistic the same as the AUC statistic which is described later in
the page?:
Instructor’s Response:
Page 65: The bootse values appear to be identical to the se values. Looking into the
bootstrapped values of bse$t, these are identical across the 100 iterations, and my
understanding of the code in the function is that I will simply reproduce the bootmod model. Is
there something that should be varied in each iteration of the function?
The code is incorrect on page 64. The bootstrap procedure calls the t() function
repeatedly. Each time it calls the function, it passes the original data as the first argument
(medpar) and it passes the indices that were randomly selected in the second argument. The t()
function correctly grabs the randomly selected rows from the passed argument (x) and places
those into xx, but then refers to medpar instead of xx in the glm() call. Here is the change (note
the very last specification of the data):
Making this correction, will cause the bootstrap to honor the randomly selected rows in each
iteration so that there will be randomness. Without this change, it simply estimates the original
model over and over. You correctly diagnosed exactly what was wrong with the program!
Page 72-3 LR test and drop1() function. It might seem obvious, but I'm assuming that one
should consider dropping predictors where models with predictor dropped are insignificantly
different to models with the predictor. (It might be worth saying this in the text).
Depends on who teaches the class! Often time, classes are taught in which certain rules of
model construction are fairly rigid. One such rule is deleting variables that are not
significant. Modern analysts don't do that. Variables that are theoretically supposed to be
there, or are historically there, are left in the model no matter what.
Page 74: Anscombe residuals: The a and b parameters for the beta distribution are both set to
2/3. Would this always be appropriate?
The formula for the Anscombe residuals is usually written in terms of the Hypergeometric2F1
function. It is a fairly complicated looking result from a definition of residuals. The definition
was offered from arguments that might lead to residuals that more closely followed the normal
distribution. In any event, the formula given in the book is in terms of the beta function and the
incomplete beta function, and not the beta distribution. It is very rare that someone actually
uses the Anscombe residuals.
Page 85: First line. Is the ROC statistic the same as the AUC statistic which is described later in
the page?:
Yes, exactly! The ROC is the receiver-operator characteristic curve and the area under that
curve (AUC) is the principal statistic. Various people refer to it using either ROC statistic or AUC
statistic.
when models differing by a single predictor are being compared. The LR test is basically the
same as the AIC when comparing between two models. AIC compares non-nested models as
well, for which the LR test should not be used.
Subject: P-values
Participant’s Question:
Line starting...
"Probability value (p-value) criteria for keeping predictors...range from 0.1 to 0.5...
"... We shall retain predictors having parameter p-values 0.25 or less."
Is this correct or should it be "0.01 to 0.05" and "0.025"?
Instructors Response:
We are discussing in that section of the book model construction. We are describing what we
can do to determine possible predictors for inclusion in a final model. The p-value refers to using
ulogit, as I recall. At this stage you want to be soft i allowing variables to be retained. The
reason, they may end up in an interaction. Moreover, when a predictor is in a model with other
predictors, the p-value may change substantially from when it’s in only a univariate relationship
with the response. But I have found that 0.25 is a good compromise that that most variables
over that criteria will not significantly contribute to the model even with other predictors. Some
authors suggest an even higher criteria.
Notice that here hcabgCABG is just significant (zero isn't in the confidence interval). Notice also
that the confidence interval isn't symmetric, if we find the distance from the interval endpoints
from the point estimate:
abs(confint(heart01.ahk2.glm,3)
coef(heart01.ahk2.glm)[3])
0.7499889 0.6465227
and just like with the bootstrap confidence interval, the lower end of the interval is farther from
the point estimate than the upper end (the distribution is left skewed).
Any pointers to better understand what profile likelihood confidence interval
estimates mean would be greatly helpful.
Instructor’s Response:
The profile likelihood based confidence intervals (CIs) are indeed different from the standard
way we calculate CIs: coef +/- 1.96*SE. Profile CIs are based on the profile likelihood, which is
calculated over a range of scaled likelihood values.
Using the standard method for calculating CIs, the 95% CIs for hcabg are .0952627 , 1.478059.
Compare these values with what you showed when using confint(): 0.03642966 1.4329412. Here
is a user written command in Stata for a profile likelihood and CI, called logprof. I tried it on
hcabg, and got the following results, plus a nice graphic. Do the values seem familiar?
WEEK 3:
a0 a0+a1 a0+a2
b0+b1 b0 b0+b2
c0+c1 c0+c2 c0
So, the results can be different. Even though the models don't directly perform the final
comparison, you can get those comparisons by specifying a post-hoc Wald test. Once you add in
the final comparison with a post-hoc analysis, then all models report identical results.
In the above, if you want to turn things into odds ratios, then just apply the logit transform of
the linear predictor where you set coefficients to one or zero for the specific variable levels you
want.
P(Y1=y1 and Y2=y2 and Y3=y3 and Y4=y4) = P(Y1=y1) * P(Y2=y2) * P(Y3=y3) *
P(Y4=y4)
because the four observations are independent (Y1, Y2, Y3, and Y4 all have the same Bernoulli
distribution - a distribution with the same value of p). Without the assumption of
independence, the joint distribution would have to consider covariance among the four
observations, so this is a really powerful assumption. The joint distribution can be written:
The maximum likelihood estimate of p is that value of p that maximizes this joint
probability. That is, we treat the data (y1,y2,y3,y4) as known and treat the parameter as
unknown. Instead of calling it the joint probability, we call it the likelihood - we change it's
name to emphasize that we have changed what we assume is known and unknown.
This is a really easy problem, so we could just draw a picture: Imagine that (y1,y2,y3,y4) =
(1,0,1,0). In that case, what is the MLE of p? Let's just assume that we have no idea, but we do
know that for the Bernoulli distribution, 0<p<1. So, let's just draw a picture for these values.
In R:
y <- c(1,0,1,0)
p <- seq(0.01,0.99,len=99)
likelihood <- ((1-p)^(1-y[1]) * (p)^(y[1])) * ((1-p)^(1-y[2]) *
(p)^(y[2])) * ((1-p)^(1-y[3]) * (p)^(y[3])) * ((1-p)^(1-y[4]) *
(p)^(y[4]))
plot(p,likelihood)
What is the array likelihood? It is the probability that (y1,y2,y3,y4)=(1,0,1,0) for p=0.01, p=0.02,
..., p=0.99. There is a best value for p in the sense that it makes the joint probability as big as
Is there a way that we could find the MLE of p? Something more formal than simply calculating
the likelihood for every conceivable value and looking at the picture? The answer is yes. In
mathematics, this is called a maximization problem or an optimization problem. Now, the
maximum of a function is also the maximum of the log of that function. Not too sure of that?
Generally speaking, it is easier to deal with the log of the likelihood because the log of a product
is the sum of the logs: that is, log(a*b) = log(a) + log(b). That means that the log-likelihood from
the above likelihood function can be written:
Now comes more math. To find the maximum of this function, we know that the function is
concave; that is, there is a maximum. On either side of that maximum, the function
decreases. If we were to draw a tangent line at each point on that function, the tangent line
would be horizontal (have slope=0) at the maximum and that is the only point at which the
tangent line would look like that. Equivalently, that means that the derivative of the function is
equal to zero at that point. So, whatever value of p makes the derivative of the function equal
to zero.
(y1+y2+y3+y4) = 4p
(y1+y2+y3+y4)/4 = p
= (1-p)^(y1+y2+y3+y4) * (p)^(4-y1-y2-y3-y4)
= (1-p)^2 * p^2
Derivative:
Set to zero has a trivial solution at p=0 which is not interesting since we need solutions 0 < p < 1
p = (6 +- 2)/8
p=1/2 and p = 1
p=1 is not interesting since we need 0 < p < 1. Thus, the only interesting solution is p = 0.50
To emphasize, we did not have to take the log of the likelihood. We could have taken the
derivative of the product, set it to zero, and solved. However, the function is much more
difficult to work with. So, we don't take log because it is old-fashioned. We do it because it
makes these problems MUCH easier.
No, the data are what they are. They have conditional variance. That variance needs to be
matched by your chosen model. Now, it certainly may be true that data from one source may
have conditional variance that is lower than the conditional variance of data from another
source. So, I don't like these statements because they sound like "my model is good, but
sometimes data are bad which leads me to having to change my good model".
The real statement is "I have a standard first model that I always consider. For the most part all
data that I receive are adequately modeled by this first choice. On occasion, the data are
actually over dispersed relative to my model. On occasion, the data are actually under
However, I occasionally see vendors who lose control of their processes and end up screening
their components for shipment to their customers. Generally, component parametric are well-
modeled using a normal distribution. However, when vendors screen their parts they remove
the tails of the distribution. Would this also be considered an underdispersed situation?
OK - so, this is a really great question because it asks about the subject matter of
over/under/equi-dispersion in a continuous outcome model as opposed to a discrete outcome
model. The answer (for the most part) is that you don't have to worry. At least, you don't have
to worry as much as you do when you have a discrete outcome. The reason is that in
continuous distribution models, the variance has an additional parameter that allows much
more flexibility to estimate the conditional variance. So, the problem just doesn't arise. Now, in
the particular case that you brought up, what if the assumption of a continuous outcome
modeling outcomes on an infinite (or, at least, very wide range) just isn't really true? Shouldn't
we consider truncated versions of those distributions as we do when we consider truncated
distributions for count outcomes in restricted ranges? The short answer is: sure - even if it
doesn't gain us all that much efficiency. There are continuous outcome models based on
truncated distributions. Some specific ones are tobit regression and truncated
regression. While they aren't called over/under dispersion situations, they do explicitly treat the
limited range of the outcomes.
Can you give me a simple way to understand why the model SEs need to be adjusted?
Instructor’s Response:
Let's list out some facts and then show how these various choices of standard errors relate to
those facts.
(1) If we assume that the outcomes follow a distribution (Bernoulli, Binomial, Poisson, etc), then
the model-based standard errors are a consequence of that distribution. That is, the standard
errors are driven by the variance of the assumed distribution.
(2) If the data really do come from the assumed distribution, then the standard errors from the
distribution will be "correct".
(3) If the conditional variance of the data closely matches the conditional variance prescribed by
the assumed distribution, then the Pearson dispersion statistic will be about 1. Because the
conditional variance of the data is the same as the conditional variance prescribed from the
distribution, the model-based standard errors from the assumed distribution are "correct" - the
right size to ensure that statistics based on those standard errors will have their prescribed
properties.
(4) If the conditional variance of the data is less than the prescribed conditional variance, then
the data are underdispersed relative to the assumed distribution and the Pearson statistic will
be less than 1. Because the conditional variance of the data is smaller than the conditional
variance prescribed from the distribution, the model-based standard errors from the assumed
distribution are too large. If we multiply the model-based standard errors by the Pearson
statistic, they are probably closer to the correct size (though we don't have any compelling
theoretical justification, and we really don't know how small the Pearson statistic should be to
compel us to make this adjustment).
(5) If the conditional variance of the data is greater than the prescribed conditional variance,
then the data are overdispersed relative to the assumed distribution and the Pearson statistic
will be greater than 1. Because the conditional variance of the data is larger than the
conditional variance prescribed from the distribution, the model-based standard errors from the
assumed distribution are too small. If we multiply the model-based standard errors by the
Pearson statistic, they are probably closer to the correct size (though we don't have any
compelling theoretical justification, and we really don't know how small the Pearson statistic
should be to compel us to make this adjustment).
(6) If we believe that 4 or 5 is the case, then in a very real sense, we are saying that the assumed
distribution is the wrong one. Not for the mean, but for the variance. When we say that, we
acknowledge that the regression coefficient estimates are correct, but the incorrect standard
errors will lead to incorrect inference.
(7) If we decide that there just isn't any way to figure out the correct distribution, we can derive
an alternate standard error that is "distribution-agnostic". The sandwich variance estimate
combines the model-based variance estimator with a non-parametric
variance estimator. Because of this, it more closely resembles the variance of the observed data
when the data do not conform to the assumed distribution.
(8) If we use (7), then in a very real sense, we are saying that the assumed distribution is the
wrong one.
Can you give me a simple way to understand why the model SEs need to be adjusted?
We use standard errors for things like building 95% confidence intervals. 95% confidence
intervals have the property that if experiments are carried out over and over, then 95% of those
intervals will contain the true parameter value. If the standard errors are wrong, then maybe
99% of the intervals will contain the true parameter, or maybe 73% will contain the true
parameter. A standard error is used to create statistics for hypothesis testing. If the standard
error is correct, then the statistic is at a level of significance that we say. If the standard error is
wrong, then the true level of significance might be smaller or larger than we say.
We adjust SEs to ensure that the statistics we compute have the properties that we say.
Does R automatically perform the adjustment when used with grouped data (i.e. when it is given
the cases command)? If not, what actions do I need to take to force the adjustment to occur?
No. That a model is binomial instead of Bernoulli is irrelevant. Data can be over dispersed
relative to the Bernoulli just like they can be over dispersed relative to the binomial. So, there is
no distribution-specific reason to automatically scale. You can decide to scale if the Pearson
statistic is bigger than 2 or smaller than one half (those are common rules of thumb). Or, you
can decide to use robust standard errors. Or, you can decide that the evidence is not compelling
enough to use alternate standard errors.
Perhaps. I am of the opinion that such things should be considered in context. If you want, you
can run your analyses under a rule of thumb about alternate standard errors always being
applied with the Pearson statistic is bigger than 2 (for an example). In those cases, you must
then never rely on any likelihood-based statistics (AIC, LR tests, etc) because, after all, you have
"declared in a sense that the model is wrong". So, you see, applying the standard errors is a
real knee-jerk reaction to fixing a problem, but it has a cost. There are no free lunches!
Now, all of the above discussion on my part is on the definition of cut points as are used to
decide what predicted level is described by a model. Another definition of cut points is in how
best to convert an M-outcome level (or continuous) variable into a K-level variable where
K<M. For example, you might ask people what level of education they have, and get answers
1=1st grade, 2=2d grade, ..., 12=graduated high school, 13=college, 14=post-graduate. You may
want to turn this 14-level variable into a 3-level variable by defining, let's say, 1-8 (less than high
exp(.7703072*1+.0100189*1+(-.0476866)*1*1+(-1.048343))
= 0.7292756
so the probability of death given white=1 and los=1 is
= Odds/(1+Odds)
= 0.7292756/(1+0.7292756)
= 0.4217232
So I guess my question is how to interpret the probabilities at the bottom of page 214.
Instructor’s Response:
Usually a researcher is only interested in calculating odds ratios for coefficients, where an odds
ratio is the ratio of the odds of one level of the predictor compared to the odds of the reference
level of the predictor. Therefore, if die is the response and gender (1-male;0-female) is the
binary (1/0) risk factor, or explanatory predictor, and the odds ratio given for gender is 2, then
the odds of death or males compared to females is 2, or males have twice the odds of death
than females, etc. Odds ratios in this respect deal with the model as a whole, and represent the
mean for all observations.
The second statistic that is normally of interest to researchers is the probability of y==1. For our
example above, we would be interested in the probability of death for each individual in the
data. Whereas parameter estimate odds ratios are with respect to the model as a whole (the
mean for all individuals) the probability is with respect to each individual. Actually, probabilities
are calculated for covariate patterns, since they consist of identical values. We convert the linear
predictor, xb, into the probability of y==1 by calculating for each individual, 1/(1+exp(-xb)) or
exp(xb)/(1+exp(xb)).
However, we can also convert the individual probabilities to the odds of an individual dying, for
example, by using the formula given in the book. This is rarely used in research, but it has been.
Given our knowledge that the odds of x are defined as p/(1-p), where p (probability) is from 0 to
1, the odds of x==1 is calculated as
odds1 = p1/(1-p1)
that the odds of x==0 is
odds0=p0/(1-p0)
We get the same 1.353255 value for the odds ratio as displayed in the model above. You may
convert back to the predicted probabilities for each individual by using the following code.
prob1=mu1 and prob0=mu0.
. gen prob1=odds1/(1+odds1)
. gen prob0 = odds0/(1+odds0)
I hope that this clarifies what is happening. I don't believe you'll find this breakdown of the
relationship of odds, OR, probability, and observation odds elsewhere, but it is does all fit
together remarkably well.
....MODEL....DEVIANCE....DIFFERENCE....DF....MEAN_DIFFERENCE
=============================================================
intercept... 1486.290
MAIN EFFECTS
anterior.... 1457.719........28.500.....1........28.500
hcabg........1453.595.........4.124.....1.........4.124
killip..........1372.575........81.020.....3........27.001
agegrp.......1273.652........98.923.....3........32.974
=============================================================
The values are different, but the results are clearly the same. If you would like to complete it for
the interaction of anterior and killip levels and hcabg and age levels, it would be interesting to
see the results. I strongly suspect that it will be the same.
Your question had to do with differences in observations between some of the component
models. I looked up the original M&N source, and some recent sources that show this method,
and no one mentions differences in observations due to missing values. However, I think you
make a valid point.
Participant’s Comment:
I've been playing around a little more, and in R, the function anova() given a single model as an
argument generates a similar table.
So these values seem essentially identical to your table. With respect to the book, hcabg is now
significant, the Df for the interaction terms are 3 instead of 4, and it looks like the interaction
term of anterior & killip is significant here (it certainly has a noticeably larger Deviance
difference than the value in the book).
What got me thinking along these lines was an analogy to likelihood ratio tests where the
models have to be nested and use the same set of observations.
Subject: quasibinomial
Participant’s Question:
I noticed that when using the 'quasibinomial' family in the R glm function that the results do not
produce an AIC value. What is the best way to compare models built using the 'binomial' vs.
'quasibinomial'?
I am not confident in understanding in how to confirm if there is an issue with over-
dispersion. Can you clarify how to confirm if over-dispersion is a problem within the model
being analyzed?
Instructor’s Response:
The reason is that scaled SEs are not directly derived from the model variance-covariance
matrix. Remember, SEs are the square root of the diagonal terms of the V-C matrix. This is easy
to see when using R to determine the SEs of a model. If I ran a logistic regression in R with the
glm function, naming the model "mymodel", and wanted to create a vector of predictor SEs to
use in some other operation, I would need to use the following code: se <-
sqrt(diag(vcov(mymodel))). Anyhow, the V-C matrix is based on the model log-likelihood
statistic. But scaled SEs are not calculated that way. yes, we start with the model SE, but then
multiply them by the square root of the Pearson Chi2 dispersion statistic. This means that you
cannot use the log-likelihood to compare models. And since the AIC and BIC statistics are based
on log-likelihoods, they are not applicable for scaled (or quasi-) models.
To assess if a logistic model or a scaled logistic model (quasibinomial) is preferred, run the two
models and check the values of the SEs. If there has been a change -- especially a substantial
change -- in their values, then the scaled or quasibinomial model is preferred. it has adjusted the
SEs for the extra correlation in the data. Technically, scaling attempts to provide values of SEs
that would be the case if there were no extra correlation in the data. The coefficients (or odds
ratios) are unchanged in the two models. What I have said here also applies to overdispersion or
extra correlation in grouped logistic models. The best way to adjust for overdispersion in
grouped models (indicated by having the dispersion statistic be greater than 1.0) is to run a
beta binomial model. Check the AIC and BIC statistics of both models and choose the model with
the lower AIC/BIC value. For Bernoulli models I suggest that you employ robust or sandwich SEs
when modeling the data. Compare the SEs with a logistic model where the model SEs are left
Subject: Log-Likelihood
Participant’s Question:
1. Need to understand what is the log-likelihood, I've always seen this number in the analysis
output, but what it actually means? and what it's value telling us and how to interpret it?
2. Let's say we have continuous predictor variable, over the range of continuous number, how
can we group them into category data, what is the method to facilitate this objective, any rule of
thumb?
3. All the examples and analysis procedure in this course are presented in Stata
and epidemic field focus. My background is in Engineering and Minitab is the software I used for
this course therefore I'm having hard time to follow up in some concepts. Just curious if you
know or have any resources to show logistic regression application in engineering field with use
of Minitab, I probably understand further more
I appreciate for your info.
Instructor’s Response:
1) See page 63-65(reference Logistic Regression Models, Joseph M. Hilbe, 2009). All GLM-based
models, and all parametric statistical models in general, are based on an underlying PDF
(probability distribution function). The likelihood is simply a re-parameterization of the PDF such
that given a set of fixed observations, the function is used to determine the most likely
parameters that produce the observations. In other words, the likelihood function solves for the
parameters that make the observations or response term (y) most likely. Recall that a PDF start
with given parameters and calculate the observations. The likelihood function works in the
opposite direction.
Like the PDF, the observations in the likelihood function are assumed to be independent. The
product of all terms of the PDF equal 1.0. However, numerically it is much easier to deal with
sums of terms rather than products. The natural log of the likelihood function is called the log-
likelihood function, and is used for all maximum likelihood algorithms.
Bayesian analysis, however, uses the likelihood, multiplying it by a prior probability distribution
to obtain a posterior distribution. The mean value of the posterior distribution of an individual
predictor is the Bayesian coefficient.
2) Continuous predictors are usually categorized or factored into groups depending on the
nature of the distribution of the predictor. There is no hard fast rule though. If a continuous
predictor has a smooth range of values, usually one will divide it into 3 to 5 equally sized groups.
Or, if the variable is not smooth and seems to almost have breaks in the range of values, it may
be wise to categorize based on the natural breaks in the data. There are a variety of different
categorizing strategies. Which you employ depends on the data and how you’re looking at it.
3) You can likely find many applications in engineering which can be modeled using logistic
regression. In fact, there are several books I know of that are on Generalized Linear Models
authored by engineering profs for their students. Doug Montgomery at Arizona State (where I
am also located) is probably the most famous of these. The author of SAS's Genmod procedure,
which estimates SASs GLM, GEE, and all GLM-based Bayesian models, has his PhD in
I tend to drop factors when my consulting subject matter expert tells me that it isn't
theoretically justified or that the inclusion of other factors takes the place of the candidate for
exclusion. I like that reason far more than any kind of "automated" or "rule-based" reason.
To be clear, other people might offer a different response here. I am comfortable with my
response because I rarely deviate from the attitude that the model is just a tool that is going to
allow me to draw inference about a population and address the study hypotheses. There are
lots of tools in the toolbox, and frankly, they almost always lead me to the same answer anyway,
so justification of my choice is rarely required.
Instructor’s Response:
I hesitate to say that these are "the basics" because in some cases, this is actually more
complete than might be carried out. That said, this is a nice collection of steps that would stand
up to scrutiny of most analyses. I will say that I personally have never carried out the Hosmer-
Lemeshow test. Mostly, because I just don't have much confidence in it. If my client wanted it,
and there was a continuous covariate (or I could verify that there were a really large number of
covariate patterns, then I would run it. Otherwise, I would talk them out of it.
The other comment that I would make is that I tend to bounce between 1 and 4bii as the
covariates of interest are identified. Then, I go through the other items on the list. In building a
model, I often create a table of all possible covariates and run two sample tests for the outcome
to see whether there is an unadjusted association, and to investigate the extent and nature of
missing data over all of the covariates. The bivariate associations and the missing data
investigation might uncover certain estimability problems like quasi- or complete
separation. This is when a covariate doesn't have variability across at least one level of the
covariate. Imagine that you have an outcome success/failure and that you have data on people
for which you know their sex. Now, imagine that some of the males fails and some
succeed. However, imagine that all of the females succeed. In such a case, usual logistic
regression models will fail to estimate a parameter for female.
So, in your create model step 1, add: 1a) calculate bivariate associations (checking for
estimability problems) and 1b) record missingness. Otherwise, it looks good!
A L
Adjusted Model SEs · 32 Likelihood ratio statistic · 24
AIC · 21, 22, 41, 42 Likelihood ratio test · 24
AIC statistic · 42 Linearity of Slopes Test · 46
ANOVA · 21 Linearity Test pg 89 · 22, 23, 24, 25, 40, 42, 44, 46
link function · 7, 8
Logistic link function · 7
B logistic model · 25, 42, 45
logistic regression · 24, 39, 42, 43, 44, 45
Bayesian coefficient · 44 logit · 10, 24, 25, 38, 45, 46
BIC statistics · 42 Log-Likelihood · 23, 44
Binary Logistic Model · 45 Log-Likelihood formation · 27
Box-Tidwell test · 24
M
C
mean · 9, 10, 23, 37, 38, 40, 44, 46
Canonical form · 4 Missing value · 41
Chi2 goodness fit using beta-binomial · 47 Modeling Steps · 49
confidence interval · 23
Confidence Intervals
quantiles · 4 O
Continuous Variables · 36
Cut Points · 34 odds ratio · 10, 37, 38, 39
Odds Ratio · 11, 37, 38
Odds Ratio/Standard errors · 7
D Over-Dispersion and Under-Dispersion · 30
Over-fitted models · 20
Deciding Variables in Model · 47
Deviance statistic · 40
dispersion · 50 P
Dropping a Significant Factor · 49
pearson dispersion · 46
Percentages in Logistic Analysis · 2
E Pg75 Course text P-values · 22
point estimate · 23
empirical distribution · 26 posterior distribution · 44
Power and Unbalanced Questions · 1
Predict function · 8
F prior probability distribution · 44
probit model · 45
profile likelihood · 22, 23
Fracpoly · 25
P-value · 45
fracpoly models · 26
G Q
quantile regression · 26
g values in Hl Algorithm · 11
quasibinomial · 43
Grouping Data · 32
Questions · 19
S W