You are on page 1of 53

Logistic Regression

Instructor(s):Dr. James Hardin and Dr. Joseph Hilbe

(Reference book for old course: Logistic Regression Models, Joseph M. Hilbe, 2009
(Reference book for new course from sept 2015: Practical Guide to Logistic Regression, Joseph
Hilbe, 2015)

Note: These discussions have been selected from prior courses and lightly edited to correct typos
and remove student names. A useful way to use this resource is by browsing, or via the index.

WEEK 1:

Subject: Power and Unbalanced Questions


Participant’s Question:
I have a few questions. I know I'm a little ahead of week 1 in the book, however, I am interested
in how one determines when their sample is to "small" or "unbalanced" (both are talked about
on p. 96).
First, how do we determine power for logistic regression based on the sample size? I've seen
the power table for other parametric tests but cannot find anything specifically for logistic
regression and therefore have difficulty in determine if my sample sizes are large enough. I am
about to start a study that will have n = 130 (group 1 = 65, group 2 = 65), however, I can't figure
out how much power I will have (and I don't know what size of effect to expect from the study
as it's exploratory). Any guidance or suggestions would be appreciated.
Second, are there any general rules for what constituted unbalanced. Obviously there can be
degrees of unbalanced (e.g., Group 1 = 65 and group 2 = 60 is less unbalanced than group 1 = 90
and group 2 = 60) but at what point should we be concerned? Are there any general rules about
how when we could be concerned about balance (e.g., if groups are off by >= 10 you should be
concerned, or something like that)?
Instructor’s Response:
Short answer about power is that I rely on software to do that for me. My first power tip is use
software: I personally use PASS from NCSS software out of Kayesville, Utah. That said, Stata and
SAS have dedicated commands to help provide solutions to situations ranging from basic (two
sample t-test) to complicated (even allowing you to set up simulations).
My second tip: I always discuss power in terms of standardized differences instead of absolute
differences. A standardized difference is a difference in terms of standard deviations. Imagine
that we are comparing two means. If I tell you that the absolute difference of interest is 5 you
really have no idea whether that is a small or large difference. If the standard deviation is 0.1,
then a difference of 5 is enormous. On the other hand, if the standard deviation is 100, then
this is a really tiny difference. A great reference is "A Power Primer" by Jacob Cohen in
Psychological Bulletin, July 1992, 1, 155-159. In that article, Cohen advises that a small
standardized difference (for which you need greater power which is usually called for in an R01
study) is 0.2 standard deviations; a medium standardized difference (for which you need good,
but not great, power as called for in most R21 studies) is 0.5 standard deviations; a large
standardized difference (for which you need only small power which might be called for in a
pilot or hypothesis-generating study) is 0.8 standard deviations. This is great, because you can

© Statistics.com, LLC Page 1


always multiply by the standard deviation of the particular measures that you are collecting to
see what the absolute detectable differences are.
My third tip: I almost always simplify my power question to a two-sample mean question. Even
if my question is that I want to compare treatment group versus control group in a logistic
regression, I am basically running a t-test on the coefficient. So, I can get the power answers for
a two-sample t-test.
Most software will allow you to consider equal or unequal sample sizes, so that is not a
problem. It is also not a problem in analysis as long as things can be managed in data collection
to randomize over other covariates. Even if I am going to include those other covariates in the
analysis, I still try to randomize in such a way as to balance those covariates over my group
assignments of main interest (treatment and control groups, since [mostly] I am a health
studies biostatistician).
I don't have rule of thumb that makes me nervous about differences in group sizes. Especially,
since I often analyze data for which there is 1 to 5 matching! That said, if I am nervous about it
or if the data are observational [with no randomization to groups], then I compare each
covariate for differences across the groups of interest. If there is evidence of non-random group
assignment, I can consider propensity matching if I want to make a pseudo-balanced treatment
study. Otherwise, I simply ensure that I include all of the other covariates in the analysis even if
those covariates are not significant just so that I get a cleaner estimate of the group effect of
interest.

Subject: Percentages in Logistic Analysis


Participant’s Question:
On page 30 and 31 Hilbe has interpretations of two different logistic analyses. Above each is the
R code and output and in the interpretation he states percentages, ie. 37% greater odds, etc.
I do not understand how he is deriving the specific % values in the interpretation statements
from the output given above each interpretation. Can you provide some insight?
Instructor’s Response:
Recall that logistic regression uses the logit link. On that page that is, the entire model is driven
by:

P(died = 1) = exp(b_0 + b_1*FactorType2 + b_2*FactorType3)/(1+exp(b_0 + b_1*FactorType2 +


b_2*FactorType3))
b_0 = intercept
b_1 = coefficient on the indicator variable of type2 procedure (urgent)
b_2 = coefficient on the indicator variable of type3 procedure (emergency)
This model basically hypothesizes that there are 3 underlying populations (defined by type of
procedure) for which patients die with (possibly different) probabilities.

Population 1: elective procedure patients, FactorType2=0, FactorType3=0


P(died=1 | Population 1) = exp(b_0)/(1+exp(b_0))
P(died=0 | Population 1) = 1-exp(b_0)/(1+exp(b_0)) = 1/(1+exp(b_0))
Odds(died=1|Population 1) = P(died=1|Population 1)/P(died=0|Population1) = exp(b_0)

Population 2: elective procedure patients, FactorType2=1, FactorType3=0


P(died=1| Population 2) = exp(b_0 + b_1)/(1+exp(b_0 + b_1))

© Statistics.com, LLC Page 2


P(died=0|Population2)= 1/(1+exp(b_0+b_1))
Odds(died=1|Population 2) = exp(b_0 + b_1)
Population 3: elective procedure patients, FactorType2=0, FactorType3=1
P(died=1|Population3)= exp(b_0+b_2)/(1+exp(b_0+b_2))
P(died=0|Population3)= 1/(1+exp(b_0+b_2))
Odds(died=1|Population3)= exp(b_0+b_2)
Odds ratios (ratio of two different odds)
Odds ratio of Population 2 odds to Population 1 odds = exp(b_0 + b_1)/exp(b_0) = exp(b_1)
Odds ratio of Population 3 odds to Population 1 odds = exp(b_0 + b_2)/exp(b_0) = exp(b_2)
So, from the output we know that exp(b_1) = 1.3664596 is the estimated odds ratio of the odds
of death for FactorType2 (urgent) patients compared to the odds of death for FactorType1
(elective) patients. That is, the odds of death for urgent procedure patients is 1.36 times the
odds of death for elective patients. Another way of saying that, is to say that urgent procedure
patients have a (1.366-1.00)*100% = 36.6% times higher odds of death compared to odds of
death for elective procedure patients.
Instructor Continued:
Because we have groups in this model, we can easily relate everything to the data. Here is a
tabulation of the data:
Population 1 Population 2 Population 3
died=1 364 104 45
died=0 770 161 51
Total 1134 265 96
Odds of death for population 1: (364/1134) / (770/1134) = 0.4772 = intercept in our model =
b_0
Odds of death for population 2: (104/265) / (161/265) = 0.64596
Odds of death for population 3: (45/96) / (51/96) = 0.88235
0.64596 / 0.4772 = 1.366 = exp(b_1) from our model. The odds for population 2 is 1.366 times
the odds for population 1
0.88235 / 0.4772 = 1.866 = exp(b_2) from our model. The odds for population 3 is 1.866 times
the odds for population 1
Alternatively,
0.64596 = 1.366 * 0.4772 = (1.000 + 0.366)*0.4772 = 0.4772 + 0.366*.4772 = 36.6% higher odds
0.88235 = 1.866 * 0.4772 = (1.000 + 0.866)*0.4772 = 0.4772 + 0.866*.4772 = 86.6% higher odds
In either case, our model coefficients/odds ratios directly capture the information that is in the
data!

Subject: Two link functions


Participant’s Question:
Question 1
In section 1.3, I see that the link function is defined in two ways: ln(p·(1-p)) and ln(μ·(1-µ)). I
assume the former is for the case of a binomial distribution and the latter for the case of a
Bernoulli distribution. Is that correct?
Question 2
In section 1.3, the following statement is made with no context that I can find.
The mean of the distribution can be obtained as the derivative of the negative of the second
term with respect to the link. The second derivative yields the variance.

© Statistics.com, LLC Page 3


This truth of this statement was not obvious on first reading :-) When I tried to verify this
statement, I was led to the concept of a cumulant and proving the statement appears to follow
given the concept of a cumulant. Is there a more basic way to prove this statement?
Instructor’s Response:
Question 1: the two presentations are just talking about the same thing. The mean (in general)
is usually presented using the Greek later mu, and the mean when discussing Bernoulli or
Binomial is usually presented using the letter p (which is usually the probability of success for a
single trial). In any case, the link function is given by log(p/(1-p)) = log(p) - log(1-p) which is
slightly different than you quoted above.
Question 2: You are correct. I am answering this without benefit of looking at the book, but the
Bernoulli/binomial distributions are members of the exponential family of distributions for
which the term you reference is undoubtedly the cumulant function. That function has the very
properties that you mention. So, the result that is quoted is a consequence of the fact that the
Bernoulli/binomial distributions are member of the exponential family of distributions, and if
the distribution is written in exponential family form, then the cumulant function is identifiable
because the exponential family carries that property. The explanation is a bit circular, but the
truth of the italicized sentence in your question is consequence of a cumulant function which
are identifiable in the distribution of interest because it is a member of the exponential family of
distributions which carry the property that when written in a specific form, identify that
function.

Subject: Confidence Intervals


Participant’s Question: When calculating the confidence intervals on the coefficients of the
model predictors, the book references the quantiles as derived from a standard normal
distribution, i.e. 1.96 for 95% Confidence Intervals. Why wouldn't the quantiles be based on a t-
distribution, where the degree of freedom are based on the number of data points and number
of model predictors?
Instructor’s Response: They are for a normal of linear (OLS) regression, but not for logistic
regression or other members of the generalized linear models family. The coefficients are
assumed to follow a normal distribution, whereas the coefficient of a normal linear model is
assumed to follow at distribution. Maximum likelihood theory, upon which GLMs and other
basic regression models are mostly based assume the normality of coefficients. For instance, a
Poisson, or negative binomial, or let's say even a finite mixture model -- these coefficients are
assumed to be normally distributed. Now, this is in fact not always the case, and it’s possible to
test them. But if the model is well fit the coefficients are in fact very close to normal.

Subject: Canonical form


Participant’s Question:
Can you elaborate in layman's terms what is meant by a function being in its "canonical
form"? How would one know if a function is in "canonical form"? What defines that? What
other "forms" are there? Are they just called "non-canonical"? Or are there other "forms" of
functions that are important in this context?
Instructor’s Response:
Canonical form simply means its natural form - the same as the PDF. The PDF for the Bernoulli
distribution, in exponential form, is y*log(p/(1-p)) + log(1-p). It’s in exponential form because it
fits to the master form for the exponential family PDF; exp{y*theta - b(theta) + c(y;psi)}. A scale

© Statistics.com, LLC Page 4


parameter is also used in this formula, but for the binomial, Poisson, geometric, and negative be
binomial distributions, the scale is defined as 1. So it is not needed. Remember, the Bernoulli
PDF is a subset if the binomial.
You can see that the Bernoulli (or binomial) PDF is in the same form as the exponential family
form. The first term for the exponential family is y*theta, where y is the response term and
theta is the link function.
For the Bernoulli PDF it is y*log(p/(1-p)). Theta = log([p/(1-p)).
And that's correct. p/(1-p) is the definition of odds, TO log it is to have a log-odds, or logit. The
link defines the model. The second term is b(theta), which is called the cumulant. The unique
feature of the cumulant is that its first derivative with respect to theta is the mean. The second
derivative is the variance function. The Bernoulli PDF cumulant is log(1-p). its 1st derivative is
the mean , p, and the second is the variance, p(1-p).
The definition of the Bernoulli is the same as it is for the distribution with a logit link. If I
replaced the link function log(p/(1-p)) with cumulative normal function, the model becomes a
probit model. If we replace it with log(-log(1-p)) it is called a complimentary loglog model. Both
the probit and cloglog models are non-canonical since they do not have the natural form of the
distribution.

Subject: Risk ratio and Odds ratio


Participant’s Question:
I'm reading the revision of Chapter 2.1(reference Logistic Regression Models, Joseph M. Hilbe,
2009), and I'm realizing that I'm getting confused by the language of Risk and Odds.
First, in this sentence:
"As a prelude, the risk of y given x=1 is D/(B+D), and of y given x=0 is C/(A+C)."
I'm assuming that we are taking about the "chance" that y=1 given that x=1 in the first part, and
given that x=2 in the second part.
Since the example is about death being the response that =1, it makes sense to talk about risk,
but in the cases I imagine in my own work, I want to assess what can help students pass a course
or graduate, so my y would normally be 1 if the student passes and 0 if he or she does not pass.
So am I correct in calling the "risk factor" the "chance" when I'm looking at success?
I understand odds to be simply the ratio of the two outcomes in the given data, within the
confidence intervals. Whether 1 is success or failure is I guess arbitrary mathematically, and is
rather based on focus of the data
Instructor’s Response:
I wrote a rather long explanation, thought i posted it, and see that it did not. I'm not going to the
length of discussion again, but let me show this.

X0 X1
-----------
Y1 A B
Y0 C D
-----------
A+C B+D

Odds refer to p/(1-p) where p is a probability, An odds ratio is the ratio of this same formula,
with the numerator with respect to 1 and the denominator with respect to 0, where 1 and 0 are
two levels being compared.
With respect the table above, an OR can be characterized as (I'm using 2 for an example)

© Statistics.com, LLC Page 5


[B/D]/[A/C]. The odds of y=1 is 2 times greater for X1 than it is for X2
For risk or rate ratio [B/(B+D)]/[A/(B+D)] The likelihood or probability of y=1 is 2 times greater
for X1 than it is for X2.
I suggest using "odds of y is" for logistic regression models and "likelihood of" for Poisson and
negative binomial regression. When the numerator (A and B) of the relationship is very small
with respect to the denominator, the odds and risk ratios will be close to the same. But its safer
to just use odds language for logistic models.
Confidence intervals don’t have anything to do with it.
Participant continued:
So risk or rate ratio is the same as odds ratio? I was thinking it might be risk of error, but you've
cleared that up since confidence intervals have nothing to do with it.
Odds of y is still to me a ratio of the probability of y=1 over the probability of y=0. Is that
correct? Then the odds ratio is the odds of one x value having a y=1 over the odds of a
reference x value having a y=1.
Instructor continued:
Risk ratio is quite different from an odds ratio, as I showed in my previous communication. The
denominators of each are different.
Yes, you have it right about an odds and odds ratio. The ratio aspect is that the numerator of
the ratio is of the odds of interest compared to the denominator (odds of the reference level).
For a binary (0,1) predictor, the reference level is 0; for categorical variables, the reference is
the level excluded from the regression, and for continuous variables, the reference is the lower
of two sequential values; eg the reference is age 40 when you are determining the odds ratio of
age 41).

Subject: Probability distribution function


Participant’s Question:
The concept of a 'probability distribution function', and the special cases of binomial and
Bernoulli pdfs. Could you explain? Perhaps, with some examples
Instructor’s Response:
I discuss the nature of a probability distribution and how to convert it to a likelihood and then
log-likelihood function in chapter 4 (reference Logistic Regression Models, Joseph M. Hilbe, 2009
See pages 63-68 in particular). Remember, and this is very important, parametric models such as
logistic regression and grouped logistic regression models are based on a probability function.
The key to determining which model to use for a given data situation is based on the
distribution of the model response (dependent variable for the model). When the response -- or
variable to be modeled or understood -- is binary, 1/0, then we understand that it follows a
Bernoulli (1/0) distribution, which is a subset of the binomial distribution. It is a binomial
distribution with n=1. What this means is discussed at length in the book.
The Bernoulli distribution is expressed as Product {py * (1-p)(1-y)}.
This is a PDF, or probability distribution function.
If you were modeling count data, such as LOS, then you would base your model on a PDF that is
a distribution of counts. The Poisson and negative binomial PDFs are used for this purpose.
I hope that this helps. Probability distributions are fundamental to any type of regression
modeling. Standard linear regression is based on the normal or Gaussian PDF.

© Statistics.com, LLC Page 6


Subject: Logistic link function
Participant’s Question:
What is the difference between the Binary Logistic Link Function and the Logistic Model?
Instructor’s Response:
There are two basic parameterizations of logistic regression. One is where the model has a
binary response (1/0). That is binary logistic regression. Then there is grouped or binomial
logistic regression, where the response is a proportion; i.e. the number of successes out of the
number of observations having the same covariate pattern. A covariate pattern is an
observation having a specific set of values for the predictors.

Subject: Odds Ratio/Standard errors


Participant’s Question:
Could you explain how to calculate - by hand - the standard error of the odds ratio?
Instructor’s Response:
The SE of the odds ratio is calculated using the delta method. In this case, however, the method
reduces to an easy calculation. Just multiply the odds ratio of the predictor by the SE of its
coefficient

Subject: link function


Participant’s Question:
I wanted to get a better sense of how the term "link function" is defined. In one sense, the term
seems to be used as, for example with the logit (or log-odds) form, as the transformation that
takes the binary probability function from a non-linear form to a linear form. In another sense,
it appears to represent the defining function type (for a "family of functions" ) that falls within
the scope of the generalized linear models. Would you shed some light on how best to
understand the term "link function"?
Instructor’s response:
It is a function that converts the linear predictor to the fitted value -- from nonlinear to linear
form. It is also the defining definition of the model. For instance, logistic regression, or logit
regression, is a binomial model with a logit link. Probit regression is a binomial model with a
probit link, and complementary loglog model is a binomial model with a complementary loglog
link.
Of course, this does not hold in the same way for other models; eg Poisson regression which has
a log link. The traditional negative binomial also has a log link, but the form of the negative
binomial that is derived from the negative binomial PDF has no name for the link. In my book,
Negative Binomial Regression (2007, 2nd edition=2011) Cambridge University Press I labeled
such a model as NB-C, for canonical NB. The link that derives directly from the PDF is called a
canonical link. The logit link is canonical for the binomial PDF, the log link is canonical for the
Poisson. The inverse link is canonical for a gamma model and the identity link is canonical for
the Gaussian or normal model. That is, xb=mu, or the linear predictor is the same as the
predicted value.
Let me give a simple example. Suppose that we have a logistic model with an intercept and one
binary (1/0) predictor. Let's say that the intercept is .5 and the coefficient of the predictor is .5.
We have the following linear predictors:

xb = intercept + coefficient * predictor

© Statistics.com, LLC Page 7


1 = 0.5 + .5 * 1, when predictor=1
0.5 = 0.5 + .5 * 0, when predictor=0

The fitted value is calculated using the logit inverse link function, 1/(1+exp(-xb)).
We have
xb = 1 : 1/(1+exp(-1)) = .73105858
xb = .5: 1/(1+exp(-.5)) = .62245933
Both fitted values are probabilities, which have a linear relationship with the predictors.

Subject: Intercept term in Logit linear model


Participant’s Question:
I know that the text indicates that the constant term is generally meaningless, as opposed to the
coefficients of the predictors. In the case of a logit model with only one predictor that is binary
(1/0), however, would it be reasonable to interpret the exponentiated constant term (intercept)
as the odds (not odds ratio) of the outcome when the predictor is zero. This is with the
understanding that exponentiating the coefficient of the predictor would yield an odds ratio
(not odds) of the predictor value "1" to the predictor value "0". Or, am I getting this mixed up?
Instructor’s response:
All texts that has anything to say about this all state that, the exponentiated intercept has no
interpretation. I went along with the same thing. However, it does have an interpretation
So there is an interpretation for a single categorical predictor (binary or more), and it’s simply
the ratio of the reference level for the response=1 divided by response=0. Not very interesting

Subject: Predict function


Participant’s Question:
I am mostly using R (with Rstudio) and stata a little, but I must progress.
I am not sure to fully understand the use of the predict function. I hopped being able to obtain a
prediction of the current model under various condition and thus type:

predict fit if white=0


but computer says =exp

Not allowed so I try

predict fit1 if white==0


predict fit2 if los==10

But I am not sure the way he calculates (reference Logistic Regression Models, Joseph M. Hilbe,
2009)

, for with sum fit2 the output indicates

Variable | Obs Mean Std.Dev. Min Max


-------------+---------------------------------------------------
-----

© Statistics.com, LLC Page 8


fit2 | 89 .3371759 .0180626 .2893484 .3439209
is it correct? Why does it bring me the number of obs ?
Instructor’s response:
The single "equals" sign is the sign of assignment. It calculates what is on the right hand side of
"=" and assigns the variable on the left that value. "==" is the sign of identity. It (especially in R)
is a logical relationship which if true returns a value of 1, and if false returns 0. The same is the
case when it is used to select from. If you want to calculate predicted (fitted) values only for
observations where los has a value of 10, then you must use "==". You are not assigning a value,
but rather conditioning values. If you use the command,
predict mu if white==0, only observations with 0 for white will get predicted values. Missing
values will be given to all observations with white==1.
Let's look at your example. There are 1495 observations in the model. If we create a fitted
value,, call it mlos, for only those observations having an LOS of 10, only those observations with
LOS==10 will have values for mlos. Here is the relevant code.

.predict mlos if los==10 (option mu assumed; predicted mean


alive)
(1406 missing values generated)
. tab mlos
Predicted
mean alive | Freq. Percent Cum.
------------+-----------------------------------
.6513158 | 78 87.64 87.64
.7165354 | 11 12.36 100.00
------------+-----------------------------------
Total | 89 100.00
. tab mlos, missing
Predicted |
mean alive | Freq. Percent Cum.
------------+-----------------------------------
.6513158 | 78 5.22 5.22
.7165354 | 11 0.74 5.95
. | 1,406 94.05 100.00
------------+-----------------------------------
Total | 1,495 100.00
. tab white mlos
| Predicted mean alive
1=White | .6513158 .7165354 | Total
-----------+----------------------+----------
0 | 0 11 | 11
1 | 78 0 | 78
-----------+----------------------+----------
Tota| 78 11 | 89
. sum mlos
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------
-----
mlos | 89 .6593767 .0215867 .6513158 .7165354

© Statistics.com, LLC Page 9


We see that limiting predicted values (mlos) to only those with LOS==10 gives us 89 values of
mlos. The other 1409 are missing, I can see this by typing "missing" as an option for tab. mlos
has only two values, one for when white==1 (.65) and another when white==0 (,72).
When you summarize (sum) mlos, only nonmissing values are used in the calculation, of course.
This shows up in the sum table as 89, which is correct. That is, there are 89 values of mlos, the
mean of which is .6584.
I hope that this clarifies what is happening.

Subject: Standard Errors of Odds Ratios


Participant’s Question:
In performing inference on odds ratios, we carry over the z and p-values computed for the
coefficient estimates. Strata also provide standard errors for the odds ratios. I am trying to think
through what is the interpretation of the standard error, since the confidence intervals (derived
from exponentiating the coefficient confidence intervals) aren't symmetric. Is there any
inference we can do on the odds ratios or is better to stick to working with things derived from
the coefficients, such as confidence intervals?
Instructor’s response:
(reference Logistic Regression Models, Joseph M. Hilbe, 2009)
We use the coefficients to make predictions, not the odds ratios. Remember also that the SE of
an odds ratio does not come directly from the V-C matrix, but is calculated using the delta
method. The SE of an odds ratio, let's say for white, is calculated by multiplying the model SE by
the odds ratio of white. I can use Stata's special terms for coefficients, _coef(variable], and SEs,
_se[variable] to calculate the SE of an odds ratio.
. glm died white, fam(bin) nolog nohead
-----------------------------------------------------------------
-------------
| OIM
died | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------
-------------
white | .3025126 .2049038 1.48 0.140 -.0990914 .7041167
_cons | -.9273406 .1968928 -4.71 0.000 -1.313243 -.5414377
-----------------------------------------------------------------
-------------
di _se[white]
.20490378
. di _se[white] * exp(_coef[white])
.27728702

By exponentiation the coefficient of white, we get the odds ratio. Now let's look at what the
output is for a table of odds ratios. Use of eform with the glm command provides exponentiated
coefficients, which for logit models are the Odds Ratios. The nohead option tells Stata to not
show the heading statistics and nolog says to suppress the iteration log.

. glm died white, fam(bin) nolog eform nohead


-----------------------------------------------------------------
-------------
| OIM

© Statistics.com, LLC Page 10


died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------
-------------
white | 1.353255 .277287 1.48 0.140 .9056599 2.02206
_cons | .3956044 .0778917 -4.71 0.000 .2689463 .581911
-----------------------------------------------------------------
-------------

Yep, .277287 is what we find. We do not get this value directly from the V-C matrix, but the first
component of the product generating the statistic does. Confidence intervals for Odds Ratios
are, however, calculated by exponentiation the respective low and up CIs. Thus, for white,

. di exp(-.0990914)
.90565993
. di exp(.7041167)
2.0220598

Model coefficients and CIs are the base statistics, and are used for other related calculated
statistics.

WEEK 2:

Subject: g values in HL Algorithm


Participant’s Post:
When I try different g values in the HL algorithm, some values give me the error shown in the
attached image. The command works with some values of g and not others. What causes this
error mean? What effect does changing g have on the solution?
Instructor’s Response:
The Hosmer-Lemeshow test is a fairly straightforward test. Basically, it divides up the data into
10 ordered (equal size) groups and then the expected number of positive outcomes and
observed number of outcomes are compared across the groups in a chi-squared test. The
problem is that sometimes there are ties in the data so that depending on how the data are
sorted, the statistic's value can vary. This was a huge detraction from the original
statistic. Some software packages didn't warn the user, and would produce a "random"
statistic. Depending on how many ties there were in the data, the randomness could be really
substantial. Anyway, other packages dropped calculation of the statistic since there wasn't an
agreed upon manner to handle the case. What R is doing is basically saying that they won't
calculate the statistic unless it is deterministic. So, it doesn't so much matter what argument of
g you give as just how many ties there are (along with the argument of g).
Instructor Continued:
I re-read my answer and I wasn't completely satisfied that my answer was clear. Let me try and
explain a very simple example and discuss the non-specificity of the statistic.
Imagine a simple dataset of 100 observations with one binary covariate x; let x=0 for the first 50
observations and then x=1 for the last 50 observations. Let y=1 for observations 1-10 and
observations 51-75; otherwise it is 0. In R, you could set this up like this:
y <- c(rep(1,10),rep(0,40), rep(1, 25), rep(0,25))
x <- c(rep(0,50),rep(1,50))

© Statistics.com, LLC Page 11


As you can see, we have set this up so that a logistic regression (b_0 + b_1x) will yield b_0 =
log(0.25) and b_1=log(4). How do I know that? Remember that b_0 is the log odds of the
outcome for x=0 where the odds are P(Y=1|x=0)/P(Y=0|x=0) = (10/50) / (40/50) = 10/40 = 0.25.
The b_1 coefficient is the log of the ratio of the odds of the outcome for x=1 to the odds of the
outcome for x=0. This ratio of odds is given by

[ (25/50) / (25/50) ] / [ (10/50) / (40/50) ] = 1 / 0.25 = 4.0

So, b_0 will be log(0.25)=-1.386 and b_1 will be log(4)=1.386. You can try it out

glm(y~1+x, family=binomial)

OK, not let's get the predicted probabilities. The first 50 observations all have x=0, and so the
predicted probability will be

exp(-1.386) / [1 + exp(-1.386)] = 0.20

The second 50 observations all have x=1, and so the predicted probability will be

exp(-1.386 + 1.386) / [1 + exp(-1.386+1.386)] = 0.50

Calculation of Homer-Lemeshow statistic assuming 10 groups of 10. Since the predicted


probabilities are already sorted, let's just define our 10 groups as the ordered groups of 10
observations. The expected number of positive outcomes for the first 5 groups is 0.20*10=2 and
the expected number of positive outcomes in the second 5 groups=0.50*10=5. The observed
number of positive outcomes (given the way we have our data initially) is: 10, 0, 0, 0, 0, 10, 10,
5, 0, 0. Thus, our table looks like:
2 2 2 2 2 5 5 5 5 5
10 0 0 0 0 10 10 0 0 0
which yields a chi-square statistic of 26.67 with a p-value of 0.002 (significant).
But, what if our outcome data (y) were sorted like this: y <-
c(rep(c(rep(1,2),rep(0,8)),5),rep(c(rep(1,5),rep(0,5)),5)). In that case, our table of
expected and observed outcomes would look like:
2 2 2 2 2 5 5 5 5 5
2 2 2 2 2 5 5 5 5 5
which yields a chi-square statistic of 0.00 with a p-value of 1.000 (not significant).
Now, why is this a dumb example? Well, first of all there are only 2 covariate patterns, so why
would we even think about forming more than 2 groups? Second of all, it shows that forming
the groups is not well-defined when there are "ties". That is, the first 5 groups all have the same
predicted probability, so the calculation of the statistic is now going to be driven by where we
put the y=1 outcomes among those 5 groups. If the construction of the groups is non-
deterministic, then the resulting chi-square value of the statistic is also going to be random (or
at least driven by the sort order of the data).
Just because the example is dumb doesn't mean that the statistic gets to be dumb. Use this
goodness of fit statistic cautiously and always look at the collection of covariate patterns and
how those are distributed among the groups.

© Statistics.com, LLC Page 12


Subject: ROCtest Function in R
Participant’s Post:
The ROCtest function described on page 84 has a second argument (fold). It is set to 10.
What does this represent? I don't see it explained in the book and the documentation to the
LOGIT package just says k-fold.
It must mean something to people familiar with LOGIT analysis but perhaps not to someone
totally new to it.
Instructor’s Response:
Basically, what happens in the function is that the data are divided into k random pieces called
folds. Then, one at a time, each piece is set to the side as the test data and the remaining pieces
the training data. The training data yield a logistic regression which is then evaluated using the
test data. By default, the program works with 10 sets. In dividing up the problem into subsets,
the repeated calculation of the optimal outpoint, sensitivity, and specificity allow estimation of
standard errors. Without this approach, you would get an ROC curve and a single statistic. With
the approach, you get the estimate and some idea of the standard error of the statistic, and so
can construct confidence intervals of the statistic or perform tests.

Subject: Scaled and Robust Standard Errors


Participant’s Post:
Could you please explain me the difference in scaled standard errors and robust standard
errors?
Instructor’s Response:
Both approaches attempt to address a suspected shortcoming of the model. Generally, the
conditional means are well-estimated no matter what model you choose. After all, it's pretty
easy to estimate means. The conditional variances, on the other hand, are a consequence of the
assumptions of the model. If those assumptions are not met by the data at hand, if the
discrepancy of the conditional variance of the data is also true for the population about which
we are making inference, then we must make a correction so that our inference is correct (it will
have the statistical properties indicated by our statistics; see final paragraph).
Scaled standard errors are subjective. Using them is like saying, "I know that the model I am
using doesn't quite capture the conditional variance of the data, so I am going to change the
model standard errors by multiplying all of them by the same scalar." What you don't know is
whether this solution is correct. How do you know you multiplied by the correct scalar? How
do you know the correction shouldn't have been a polynomial? How do you know the
correction shouldn't have involved the mean, since, after all, the model is heteroskedastic by
definition? If the model was correct and we really didn't need to make any correction via scaled
standard errors, the inference is not correct (estimates are still consistent) using scaled standard
errors because they are not efficient.

Robust standard errors are objective. In case you didn't know, there are different kinds of
robust standard errors. Each one addresses one or more assumptions of the model, and then
specifically protects your inference for violation of those specific assumptions. Furthermore,
they protect inference from any form of the violation. So long as the conditional means are
meaningful, the inference will be correct. If the model was correct and we really didn't need to
make any correction via robust standard errors, the inference is still correct (estimates are still
consistent) using robust standard errors. The coefficient estimates are just not quite as
efficient. Using robust standard errors is like saying, "I know that the log-likelihood is not

© Statistics.com, LLC Page 13


correct, but that is ok because I have used a variance calculation to ensure that my inference is
correct."

Note how in both cases, we basically said, "Hey, world! My model is wrong." If all we do
afterward is discuss the properties of the regression coefficients that is ok. However, we really
shouldn't carry on in an analysis where we utilize likelihood ratio tests, or calculate residuals
based on likelihood arguments. After all, we already admitted the likelihood is wrong, so we
really shouldn't try to have it both ways. Stata, for example, won't calculate likelihood-based
statistics after you estimate a model using the sandwich (robust) variance estimate. It will,
however, calculate them after using scaled standard errors which isn't really consistent
behavior. Most other packages will calculate whatever you ask for, so this advice from me is a
bit heavy-handed. That said, I do think it wise to consider that using these alternate variance
estimates is a declaration of the model being wrong, and so your toolbox should be limited
afterward. If you really think your model is wrong, then perhaps you should spend a bit of time
trying to find a model that isn't wrong.

Consistent estimates are those that converge to the true parameter value as the sample size
goes to infinity. For example, if I have a statistic J and it is estimating a parameter value P, and
the expected value of J is (P + 1/n) where n=sample size, then you can see as the sample size
goes to infinity, the expected value of J is equal to P. Thus, J consistently estimates P.

Efficiency refers to the relative variance of certain estimators. If two estimates are both
consistent, but one has smaller variance than the other as the sample size increases, then the
estimate with the smaller variance is more efficient. For example, for a sample from a normal
distribution, the sample mean has variance 1/n and the sample median has variance pi/(2*n)
which is approximately equal to 1.57/n. The sample mean is more efficient than the sample
median when it comes to estimating the population mean.
When I say that inference is correct, I mean that when I construct a 95% CI, then the
construction of that interval really would contain the true parameter 95% of the time if we were
to carry out the same experiment repeatedly. I mean that a test of size alpha=0.05 really does
have Type I error equal to 5%.

Subject: R ‘factor’ value in Logistic Regression


Participant’s Post:
I have done some logistic regression with R for a few months now. I am curious about the role of
the R 'factor' value type in logistic (and linear) regression.
I have a couple of questions:
Does the glm() function automatically 'factorize' all non-numeric, catagorical predictors, making
an intelligent guess as to what variables to convert to a factor?
In R, there are ordered and unordered factors. What does ordering a factor mean anything to
binomial logistic regression? I know it does make a difference to the results, because I tried it
(results not shown here). However, I am not sure how to interpret the differences.
For question #3 in this week's homework, I fitted the model twice: (1) with 'relig' covariate as an
ordered factor, and (2) with 'relig' just being factored but not ordered. The order ranges from
"anti-religious” to "very religious”.

# first: 'of' means ordered factor

© Statistics.com, LLC Page 14


affair_data$relig_of <- factor(affair_data$relig,
levels = 1:5,
labels=c("Anti", "Not", "Slightly", "Somewhat", "Very"),
ordered=TRUE)
# second: 'uf' means unordered factor
affair_data$relig_uf <- factor(affair_data$relig,
levels = 1:5,
labels=c("Anti", "Not", "Slightly", "Somewhat", "Very"),
ordered=FALSE)

After I fit the model to these two different value types, the coefficients are different. I do not
include the results here.
I know I can complete Question #3 without ordering 'relig.' However, I am curious what ordering
means. I am guessing that for multinomial regression ordering the dependent variable would
mean something.
Instructor’s Response:
Given that I don't often use R to do my data analysis, I was not aware of the issues that I am
about to address here.
Software packages try and help when they can, but it is our responsibility to know what is
happening. This is especially true when the software makes decisions on our behalf. When I
lecture to a live audience about this, I repeatedly emphasize that you don't want the software to
do your thinking for you. Because of that, my personal choice when using software is to NEVER
let it make decisions for me. I never tell SAS that my variable is a CLASS variable. I never tell
SPSS or R that my variable is a factor variable (ordered or unordered). I sometimes tell Stata
that I want indicators for my variable by specifying the "i." prefix on the variable name, but I use
Stata all of the time, the specification is explicit, and the specification shows up in the
output. OK - rant over.
Here is what happens in R with your variable. Let's say you have a variable F that has 5
levels. What you really want is to include an indicator for 4 out of the 5 levels (leave one level
out as the reference level). I explicitly create indicators for the 4 desired levels like this:
F1 <- 1*(F==1)
F2 <- 1*(F==2)
F3 <- 1*(F==3)
F4 <- 1*(F==4)
F5 <- 1*(F==5)

Then, I explicitly leave out the one that I explicitly want to be my reference level. For example,
like this

glm(y ~ x + z + F2 + F3 + F4 + F5, family=binomial)

If I tell R that my variable F is an "unordered factor", then when I use F in a specification it will
apply

cont.treatment(5)
2 3 4 5
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1

© Statistics.com, LLC Page 15


which creates the exact same indicators as I explicitly created to get the same model.

glm(y ~ x + z + F, family=binomial)

R chooses the smallest numeric level as the reference which may or may not be the one I want,
but at least I understand what is happening.

Now, if I tell R that my variable F is an "ordered factor", then it applies contr.poly(5)

contr> contr.poly(5)

.L .Q .C ^4
[1,] -0.6324555 0.5345225 -3.162278e-01 0.1195229
[2,] -0.3162278 -0.2672612 6.324555e-01 -0.4780914
[3,] 0.0000000 -0.5345225 -4.095972e-16 0.7171372
[4,] 0.3162278 -0.2672612 -6.324555e-01 -0.4780914
[5,] 0.6324555 0.5345225 3.162278e-01 0.1195229

As you can see, this is not the simple generation of indicator variables. Rather, it is treating
these levels as ordered and making various contrasts. If I really wanted to do this, I would
explicitly create my own variables. Those are the terms that are automatically generated on my
behalf when F is an "ordered factor" and I estimate a model

glm(y ~ x + z + F, family=binomial)

If I want to explicitly generate the variables for this model, then this is how you do it:

PF1 <- -0.6324555*(F==1) - 0.3162278*(F==2) + 0.0000000*(F==3) +


0.3162278*(F==4) + 0.6324555*(F==5)
PF2 <- 0.5345225*(F==1) - 0.2672612*(F==2) - 0.5345225*(F==3) -
0.2672612*(F==4) + 0.5345225*(F==5)
PF3 <- -0.3162278*(F==1) + 0.6324555*(F==2) - 0.0000000*(F==3) -
0.6324555*(F==4) + 0.3162278*(F==5)
FP4 <- 0.1195229*(F==1) - 0.4780914*(F==2) + 0.7171372*(F==3) -
0.4780914*(F==4) + 0.1195229*(F==5)

and then I could use them as

glm(y ~ x + z + PF1 + PF2 + PF3 + PF4, family=binomial)

This gets the same output, but it does so under my explicit guidance. I know exactly where
those factors came from. I (might) even know what they mean. The point is, had I not known
that R was doing the above, I might have thought it was creating indicator variables. Even if I
knew it wasn't doing that, if I hadn't looked through the help files on the default treatment of
ordered and unordered factors, I might not have known that the treatments differ.
OK - now, finally, I am going to answer your questions; sorry for the long aside leading us to
here.

© Statistics.com, LLC Page 16


1. Does the glm() function automatically 'factorize' all non-numeric, catagorical predictors,
making an intelligent guess as to what variables to convert to a factor?

It does not. That is, if you say nothing about the nature of a variable, then R will treat it as a
numeric variable. If you tell R to treat a variable as a "ordered factor" or an "unordered factor",
then you get the above treatment which you may or may not want.

2. In R, there are ordered and unordered factors. What does ordering a factor mean
anything to binomial logistic regression? I know it does make a difference to the results,
because I tried it (results not shown here). However, I am not sure how to interpret the
differences.

Please see the answer above. It is rather complicated and leads me to my most important
advice: always explicitly construct the variables that you want in your model and then explicitly
include them. It makes your life so much easier in the end even if it means that you have to
spend tons of time generating variables!

Subject: Summary on Logistic Regression


Participant’s Post:
I want to summarize what I think we are doing with logistic regression and see if I am thinking of
this correctly.

 We want to model a process that is binary, so we use the Bernoulli distribution.


 We REALLY like to use linear models because we have a massive (and convenient)
mathematical framework for using them.
 We have a problem because standard probabilities range from 0 to 1, but our linear
model will range (in theory) from -∞ to ∞.
 To match our linear range to probability range, we transform our probability (µ) value to
an odds. Transforming µ into odds means the left-hand side (LHS) of our equation now
has a range from 0 to ∞.
 We now take the logarithm of the LHS, which transforms its range to -∞ to ∞.
 Our right-side range now maps to our left-side range. However, the relationship is
nonlinear.

Is this summary correct?

The nonlinear mapping is what bothers me. Could you give me some insight into how things like
confidence intervals for µ on the LHS of our model are related to linear model values on the
RHS? If you could point me to a reference on how the algorithms work, that would be enough. I
know this is kind of an open-ended question, but I am just curious how it all works.
Instructor’s Response:
Here is how I would write the summary:
We want to estimate the conditional mean of an outcome given covariates, and that conditional
mean is always in (0,1), and the outcome is always zero or one. So, we use the Bernoulli
because that distribution has these properties.
Rather than trying to specify a really complicated relationship between covariates (which is
another viable alternative), I will simply transform the linear predictor (X*beta=η) to the

© Statistics.com, LLC Page 17


restricted range of the conditional means I am trying to estimate. That is, my linear predictor
(X*beta or η) will be assumed to be a real number, but the transformation of that linear
predictor g-1(X*beta) will have the same restricted range as the conditional mean I am trying to
estimate. I will call that transformation the inverse link function because it links the linear
predictor to the conditional mean (μ); equivalently, I will call g(μ) = X*beta=η the link function.
The linear predictor is linear meaning that each covariate can separately affect the conditional
mean, but the relationship of the covariates to the conditional mean will be nonlinear (forcing
me to creatively find ways to interpret separate coefficients).
I know how to get predicted values, and I know from basic statistics how to get standard errors
of predicted values. Thus, I even know how to get confidence intervals for predicted values. I
can use plug-in style estimates for confidence interval estimates of the inverse link function of
the linear predictors (using these estimates is justified in Statistics using the so-called delta
method).Basically, the delta methods says this:
Var(g(x)) = g'(x)T Var(x) g'(x) where = right here really means approximately equal
This mathematical equation says that the variance of a (possibly nonlinear) transformation of a
random variable is equal to the derivative of the function * the variance of the random variable
* the derivative of the function. I wrote it this way (in matrix notation) just in case X was more
than a scalar (as is the case in multivariable models). Note in the above, that sometimes people
put the transposed version on the right and the untransposed version on the left. Whatever
representation keeps the formula conformable is correct.
So, when you see coefficients and standard errors for your logistic regression and then want to
see odds ratios and standard errors of the odds ratios, those last standard errors are calculated
like this:
Var(OR(beta)) = Var(exp(beta)) = exp(beta) * Var(beta) * exp(beta)
which means that
StdErr(OR(beta)) = StdErr(exp(beta)) = exp(beta) * StdErr(beta) = OR(beta) * StdErr(beta).
That is, the standard error of the odds ratio is the standard error of the coefficient times the
odds ratio.
Another Participant’s Question:
One thing I am still confused about is the use of the Bernoulli distribution.
Before I took this course I assumed Logistic Regression was based on the Logistic Distribution.
What role, if any, does the Logistic Distribution play in Logistic Regression?
Why is it called "Logistic" regression? Why not Bernoulli Regression!!!
Instructor’s Response:
For two reasons, really. The first is that the logit function is also known as the logistic
function. It's not the only function we can use to transform the linear predictor to the (0,1)
interval. Another popular application is the inverse normal pdf function sometimes called the
normit or probit. Thus, if we use the Bernoulli distribution, but with the inverse cdf (quantile)
function of the standard normal, then we call it probit regression. Two more functions are the
log-log function and the complementary log-log function. Both link functions are again used
with the Bernoulli distribution, and are called log-log regression and complementary log-log (or
c-log-log) regression. Thus, we don't call logistic regression Bernoulli regression because the
Bernoulli distribution is coupled with lots of different link functions to create a regression
model. So, we use the link name instead of the distribution name (at least in these cases).
A second reason is that there is an alternative derivation of a regression model. Imagine that
you posit
y = b_0 + b_1*x_1 + ... + b_p*x_p + epsilon

© Statistics.com, LLC Page 18


where epsilon is distributed according to the logistic distribution. Now, also imagine that when
the data are collected, we don't actually collect y; it is latent. The outcome y is unknown, or at
least, mostly unknown. What I do know is whether the outcome is less than some particular
limit value. If the outcome is less than that limit, I record a 0. if the outcome is at least as large
as that limit, I record a 1. That is, I record

y-recorded = 0 if b_0 + b_1*x_1 + ... + b_p*x_p + epsilon <= CUTOFF


y-recorded = 1 if b_0 + b_1*x_1 + ... + b_p*x_p + epsilon > CUTOFF
and I assume that epsilon is an error that is distributed Logistic(0,1). Now, let's carry out ow we
will estimate this assuming that the CUTOFF=0:

P(y-recorded = 1 | x) = P(y > 0 | x) = P(b_0 + b_1*x_1 + ... + b_p*x_p + epsilon > 0)


= P(epsilon > -(b_0 + b_1*x_1 + ... + b_p*x_p))
= P(epsilon < b_0 + b_1*x_1 + ... + b_p*x_p) because logistic(0,1) is symmetric about 0
= logit(b_0 + b_1*x_1 + ... + b_p*x_p)

which is the same thing we got before.


Thus, the model is called logistic regression because (1) it uses the logit/logistic link function in a
Bernoulli regression model, and (2) because it can be derived assuming a latent outcome
variable with errors that follow the logistic distribution.

Subject: Questions
Participant’s Post:
Questions (PGLR book)
Page 65: The bootse values appear to be identical to the se values. Looking into the
bootstrapped values of bse$t, these are identical across the 100 iterations, and my
understanding of the code in the function is that I will simply reproduce the bootmod model. Is
there something that should be varied in each iteration of the function?
Page 72-3 LR test and drop1() function. It might seem obvious, but I'm assuming that one
should consider dropping predictors where models with predictor dropped are insignificantly
different to models with the predictor. (It might be worth saying this in the text).
Page 74: Anscombe residuals: The a and b parameters for the beta distribution are both set to
2/3. Would this always be appropriate?
Page 85: First line. Is the ROC statistic the same as the AUC statistic which is described later in
the page?:
Instructor’s Response:
Page 65: The bootse values appear to be identical to the se values. Looking into the
bootstrapped values of bse$t, these are identical across the 100 iterations, and my
understanding of the code in the function is that I will simply reproduce the bootmod model. Is
there something that should be varied in each iteration of the function?
The code is incorrect on page 64. The bootstrap procedure calls the t() function
repeatedly. Each time it calls the function, it passes the original data as the first argument
(medpar) and it passes the indices that were randomly selected in the second argument. The t()
function correctly grabs the randomly selected rows from the passed argument (x) and places
those into xx, but then refers to medpar instead of xx in the glm() call. Here is the change (note
the very last specification of the data):

© Statistics.com, LLC Page 19


bsglm <- glm(died ~ white + hmo + los + factor(type), family=binomial,
data=xx)

Making this correction, will cause the bootstrap to honor the randomly selected rows in each
iteration so that there will be randomness. Without this change, it simply estimates the original
model over and over. You correctly diagnosed exactly what was wrong with the program!

Page 72-3 LR test and drop1() function. It might seem obvious, but I'm assuming that one
should consider dropping predictors where models with predictor dropped are insignificantly
different to models with the predictor. (It might be worth saying this in the text).

Depends on who teaches the class! Often time, classes are taught in which certain rules of
model construction are fairly rigid. One such rule is deleting variables that are not
significant. Modern analysts don't do that. Variables that are theoretically supposed to be
there, or are historically there, are left in the model no matter what.

Page 74: Anscombe residuals: The a and b parameters for the beta distribution are both set to
2/3. Would this always be appropriate?

The formula for the Anscombe residuals is usually written in terms of the Hypergeometric2F1
function. It is a fairly complicated looking result from a definition of residuals. The definition
was offered from arguments that might lead to residuals that more closely followed the normal
distribution. In any event, the formula given in the book is in terms of the beta function and the
incomplete beta function, and not the beta distribution. It is very rare that someone actually
uses the Anscombe residuals.

Page 85: First line. Is the ROC statistic the same as the AUC statistic which is described later in
the page?:

Yes, exactly! The ROC is the receiver-operator characteristic curve and the area under that
curve (AUC) is the principal statistic. Various people refer to it using either ROC statistic or AUC
statistic.

Subject: Over-fitted models


Participant’s Question: what over-fitting actually means, and why it is a problem?
Instructor’s Response:
Overfitting is a problem that gets little notice, but it really can be a problem. Let's make an
analogy --- a map. A model is like a city or state road map. It gives a layout of the data so that
you get an understanding of it, or an overview. You can make predictions from it; e.g. follow this
road, turn on that highway, and do 30 miles to scale and you end up at Nowhere, AZ. It gives a
guide for looking a relationships and predictions of a given set of data. But let's say that you are
asked to write a more detailed map of area A. You can add more and more qualifications,
variables, and so forth that it becomes too unwieldy. There is too much information. A model is
like that in a way. The more predictors I add to the model to fine tune it, the more unwieldy it
gets, and the more likely that variables will be correlated with one another. This is a no-no in
regression. Model overfitting results means that you have too many predictors, or you have
qualified them in too fine a way. The problem is that the model you develop may be great for

© Statistics.com, LLC Page 20


the exact situation you are modeling, but you cannot extrapolate it to any other situation or
time. The goal of modeling is to unbiasedly estimate the true parameters of the PDF underlying
the model you are using to understand data, and with which you aim to make predictions. We
want our model to give us information about data situations similar to toe one you modeled.
A model should be nebulous, e.g. a bit fuzzy edged. There will always be uncertainty in data
collection and interpretation. And what exactly is statistical significance. Believe me, the
traditional 0.05 is -- or should be - only a guide to affirming that a predictor adds to our
understanding of the response variable. And - and the very base of modeling, there is the
assumption that the data we are modeling is theoretically generated from the unknown
parameters of a valid Probability distribution. This means that all observations in the model
must be independent. Well, that just does not happen much with real data. But if you try to fine
tune too much, you develop something that is pretty much useless.

Subject: ANOVA with logistic regression


Participant’s Question:
In linear regression, one can determine if addition or removal of a predictor or interaction
between predictors leads to a significant change in the outcome of the model by doing an
ANOVA between the two models
>fit1<- lm(y~a + b + c), fit2 <- lm(y~a + b + c + d,
aov(fit1~fit2),

if significant then predictor change or interaction truly changes model.


Is it appropriate to use ANOVA to test the difference between two logistic regression models (ie
model one with x predictors and model two with the same x predictors + one more predictor) to
see if addition of the predictor to the second model changed the model significantly?
Wouldn't it be more appropriate to compare the AIC values of both logistic regressions?
Instructor’s Response:
ANOVA, as a Gaussian based model, is not normally used with logistic regression. It is
sometimes used by researchers though, but by far the most recommended way to handle
testing nested logistic models is through the use of the likelihood ratio test, which is based on
the Chi2 distribution. The drop1 function in R is pretty much the standard way of testing the
worth of a predictor(s) in a logistic model. In Stata it is the lrtest command.
It is easier to check for predictor worth by simply checking the p-value of a predictor z statistic -
which is many times referred to as a Wald test. But it is not as accurate as the likelihood ratio
test as described above. The results are usually the same, but not always. I use a combination of
both, with the LR test and contextual knowledge being the final determination.
I use the AIC and an enhancement of the traditional AIC test for comparing models. however,
much has been written on the limitations of the standard AIC, and there are a variety of
enhanced versions that have become somewhat popular in the literature.
Look at the definitions of the LR test and traditional AIC. WIth s=the smaller model and f the
larger model, we have

-2(LLs - LLf) = -2LLs + 2LLf


With p being the number of predictors or parameters in a model, the AIC is defined as:
-2LL + 2p
When comparing two models, with a=model with p predictors and b=model with p-1 predictors
-2LLa + 2a - (-2LLb + 2b)

© Statistics.com, LLC Page 21


-2LLa + 2LLb +2a-2b
-2(LLfa- LLb) + 2(a-b)
-2(LLa - LLb) + 2

when models differing by a single predictor are being compared. The LR test is basically the
same as the AIC when comparing between two models. AIC compares non-nested models as
well, for which the LR test should not be used.

Subject: Linearity Test


Participant’s Question:
In Linearity Test say that the "smooths are largely artifacts of the data and may not represent
factual differences in the underlying slope". The regression table shows slopes for age23 of 1.13
and age 4 of 2.19. I interpret that as there being significant differences in the slopes. Could you
please clarify what is wrong with my interpretation?
Instructor’s Response:
Well, if you have worked with splines, it’s easy to change the appearance of the spline or
smooth by giving it different degrees of freedom. But if you can hone in on a smooth or spline
that does reflect the shape of the response variable, you can observe if there are changes in the
direction of the smooth as it progresses from the left to right of the graph. Changes in the angles
of the smooth reflect changing slopes, which are the coefficients.
Recall that a continuous predictor has a single coefficient (slope), which assumes that the slope
is constant at that value. If there is evidence of a change in slopes then it is important to factor
the variable in accordance to the slopes underling the distribution of the data. Partial residual
plots, generalized additive models, and fractional polynomials help provide the researcher with
the tools to evaluate the underlying shape of the data.

Subject: P-values
Participant’s Question:
Line starting...
"Probability value (p-value) criteria for keeping predictors...range from 0.1 to 0.5...
"... We shall retain predictors having parameter p-values 0.25 or less."
Is this correct or should it be "0.01 to 0.05" and "0.025"?
Instructors Response:
We are discussing in that section of the book model construction. We are describing what we
can do to determine possible predictors for inclusion in a final model. The p-value refers to using
ulogit, as I recall. At this stage you want to be soft i allowing variables to be retained. The
reason, they may end up in an interaction. Moreover, when a predictor is in a model with other
predictors, the p-value may change substantially from when it’s in only a univariate relationship
with the response. But I have found that 0.25 is a good compromise that that most variables
over that criteria will not significantly contribute to the model even with other predictors. Some
authors suggest an even higher criteria.

Subject: Coefficient Standard errors and profile likelihood


Participant’s Question:

© Statistics.com, LLC Page 22


In R, when you request the confidence interval for the coefficients with the function confint(), it
produces the "profile likelihood" confidence interval. It doesn't look this method is covered in
Chapter 5 (reference Logistic Regression Models, Joseph M. Hilbe, 2009). Is there an accessible
reference to describe how this works?
Continuing the analysis of the heart01 model that was used in the bootstrapping standard error
example:
death ~ anterior + hcabg + killip + agegrp
If I get the confidence interval for this model:
confint(heart01.ahk2.glm)
Waiting for profiling to be done...
2.5% 97.5%
(Intercept) -
5.21001690 -4.4508278
anterioranterior 0.31365009 0.9715998
hcabgCABG 0.03642966 1.4329412
killip2 0.46741346 1.1761202
killip3 0.24534486 1.3054987
killip4 1.96774918 3.3719984
agegrp23 0.87729078 1.6656686
agegrp24 1.53588344 2.3535143

Notice that here hcabgCABG is just significant (zero isn't in the confidence interval). Notice also
that the confidence interval isn't symmetric, if we find the distance from the interval endpoints
from the point estimate:
abs(confint(heart01.ahk2.glm,3)
coef(heart01.ahk2.glm)[3])
0.7499889 0.6465227
and just like with the bootstrap confidence interval, the lower end of the interval is farther from
the point estimate than the upper end (the distribution is left skewed).
Any pointers to better understand what profile likelihood confidence interval
estimates mean would be greatly helpful.
Instructor’s Response:
The profile likelihood based confidence intervals (CIs) are indeed different from the standard
way we calculate CIs: coef +/- 1.96*SE. Profile CIs are based on the profile likelihood, which is
calculated over a range of scaled likelihood values.
Using the standard method for calculating CIs, the 95% CIs for hcabg are .0952627 , 1.478059.
Compare these values with what you showed when using confint(): 0.03642966 1.4329412. Here
is a user written command in Stata for a profile likelihood and CI, called logprof. I tried it on
hcabg, and got the following results, plus a nice graphic. Do the values seem familiar?

. logprof hcabg, level(95)


Maximum Profile Log-Likelihood = -637.90293
-----------------------------------------------------------------
--------------
death |Coef. [95% Conf. Interval]
---------------+-------------------------------------------------
--------------
hcabg | .78666075 .03674628 1.4332469
-----------------------------------------------------------------
--------------

© Statistics.com, LLC Page 23


Subject: Likelihood ratio statistic
Participant’s Question:
the Log Likelihood ratio test. Why is it a much-preferred method of assessing whether a
predictor or group of predictors contributes to a model?
Instructor’s Response:
The likelihood ratio test is a test of the difference in the log-likelihood values of two models you
are comparing. Specifically it is -2(LL:ReducedModel - LL:FullModel)
See page 81-82 and p248-249. The difference in values is Chi2 distributed with a degree of
freedom equal to the number of predictors less in the reduced model compared to the full
model. Usually it is 1, particularly if you are testing to see if a predictor needs to be in a model.
However, there is a L-ratio test used in the Stata output for the logit command. It is a measure
of the intercept only model (reduced) vs the full model (with all of the predictors you want in
the model).
The test is more accurate than what is called the Wald test, or the test of the p-values. If a p-
value appears to be close to 0.05, I always suggest using a Likelihood ratio test.

Subject: Questions on Model construction


Participant’s Question:
1. In general do you recommend always making indicator variables using the tab, gen()
commands rather than xi: with i.x. It seems to give more flexibility. Are there any situations
when it is not preferred?
2. I could not quite tease out the reason the significance of the interaction term of the Box-
Tidwell test indicates nonlinearity of the logit with the right hand terms.
3. The fractional polynomials graph seems very similar to the second of the R graphs (GAM
smooth of adjusted age). I was able to duplicate it in R - not hassle free. In your view is there
added benefit to doing the R analysis over the fractional polynomial (ie, is it worth the hassle). If
there is an advantage, when specifically should we consider it?
4. From sectoin 5.3 (reference Logistic Regression Models, Joseph M. Hilbe, 2009)
, although using standardized coefficients gives good information with different metrics of
variables, aren't they still dependent on our picking the correct units for the variables (eg
months vs years etc)?
5. I understand the issues with study design and OR versus RR etc, but I was wondering if there
is any situation when we should leave the logistic output as odds/OR and not extend to
probabilities? Is it always OK to describe logistic regression output as probabilities? The reason I
am asking is that, conceptually to me, a probability of an outcome seems somewhat linked to a
risk of an outcome happening?
Instructor’s Response:
1: I almost always use the tab, gen() to create dummy variables. I can call them what I want, and
I can later combine levels if needed without messing up the system that checks i_*. I use xi:
when I am checking the creation of my own interactions, particularly when multiple levels are
concerned. However, I like to have control over making interaction terms. You can see how this
can be important when interpreting interactions.
2: I have never really thought about why the interaction of the continuous and log(continuous)
predictor when added to the model indicates linearity if not significant. I checked other books
that describe the method, and none of them say why this is the case either. I would think off
hand, though, that it relates to the situation when a continuous predictor, call is X, is not

© Statistics.com, LLC Page 24


significant due to its nonlinear shape. Many times logging X and entering it in the model shows
ln(X) to be significant. This is a transformation of X used to effect linearity. It shows, however,
that the model with X is not linear. Multiplying X and Ln(X0 tends to dampen the effect of ln(X)
in the model. But if it is significant, then it still shows that there is nonlinearity is in the model
without any predictors being transformed.
3: I don't think that there is an advantage, unless you are proficient in R and want to use it.
Fractional polynomial were first developed by Royston and colleagues at Imperial College in
London. Royston is a longtime Stata user, and first constructed fractional polynomials in Stata.
Later they were written into R code. From what I recall, further advanced fractional polynomials
have been developed in Stata, but are not yet in R. I have not checked lately though. It's largely
a matter of preference.
4: Standardization is metric free, and converts all predictors into standard deviations units. I
believe that I discuss some downsides though. Note that I show the standard manner of
constructing logistic standardized coefficients. Some authors have thought that they should be
constructed differently. When checking different ways when writing the text, I did not see any
appreciable difference in the resulting standardized values.
Hence I just showed the standard method. You may want to google it though and see if anything
new has come up on the matter. I never use them for my research work.
5: Unless the incidence rate is low, odds ratios should not be interpreted as risk ratios, which are
in fact interpreted probabilistically. There has been quite a bit of discussion on this subject. One
can calculate logistic probabilities by employing the inverse link to the linear predictor,
1/(1+exp(- xb)) or exp(xb)/(1+exp(xb)). Then, for each observation, you have the probability of
success. But these are not coefficients, which when exponentiated produce odds ratios.
Researchers generally keep the two analyses apart. On the one hand one can use a logistic
model to learn the odds ratios of the various predictors, or one can determine the probabilities
of each observation. Calculating probabilities also allows one to determine the probability of y
based on a particular covariate pattern you construct, as long as the pattern is within the scope
of the estimated model. This is really very useful information though, and, if the model is well
fitted, can give us important information about the world.

Subject: Fracpoly and Stata 11


Participant’s Question:
On p. 97 (reference Logistic Regression Models, Joseph M. Hilbe, 2009) of the book, the
following command is given: fracpoly logit death age kk2-kk4, degree(4) compare. I tried it in
Stata 11 and it just keeps on running (for about an hour so far now) with no result given. Using
degree(3) instead does work and the results are virtually immediate. I am using the same
dataset as used in the text for this example (heart01.dta). Am I doing something wrong or has
the command or algorithms changed since the book was published and is not now able to
converge (at least on the dataset provided in the text)? I am using Stata 11 on a Mac.
Instructor’s Response:
I just received word back from Wes, who is a senior statistician at Stata Corp, about the fracpoly
command problem. He told me that the -logit- command in Stata 11 uses a different optimizer
than the one for versions 9 and 10. This resulted in the problems we both experienced. Stata
version 11.2 came out Wednesday late afternoon and can be downloaded for free. Wes ran it on
his copy and it completed after about 35 minutes. He said, "My results were nearly the same as
those in your book. (My standard errors differed, but only in about the fifth decimal place.)"

© Statistics.com, LLC Page 25


He then told me that some of the constituent fracpoly models he tried resulted in missing
standard errors, or coefficients that blow up. This is the case with both versions 11.2 and earlier.
He said that the negative powers and the square roots seem most troublesome. However, he
concluded by saying that "I achieved fast convergence in Stata 11.2 with the -powers()- option:
powers(0, 1, 2, 3) The best powers of the final model turned out to be 3-3-3-3."
So, use fracpoly with care, and try using the powers() option. I like quantile regression for
checking the appearance of an empirical distribution. But fractional polynomial are popular with
a number of researchers. So I thought it warranted discussion.

WEEK 3:

Subject: Reference level in variable


Participant’s Question:
When we add a variable that has different levels, one level (class, type, etc) becomes the
reference. That means we are not examining its contribution to the response, yes? We can
change the levels so a different reference is assigned and rerun the model. But, as I just
discovered, a level indicated as significant in one model may not be significant in the model with
a reassigned reference.
If I wanted to know the contribution of all levels of a variable, how does one do that? And, if the
different models show differences in the significance of a level, depending on the reference,
how does one understand what is the actual odds-ratio and p value for each level?
Instructor’s Response:
It can be confusing, but all of the levels are always recoverable from the model. What the
specific coefficients represent (in terms of a statistical hypothesis test) differs for each
specification. Let's take a linear regression specification with a simple example of a 3-level
variable. Imagine that we create 3 indicator variables for the three levels of that variable:
Level1, Level2, and Level3.

Model 1: y = a0 + a1*Level2 + a2*Level3 Level 1 = reference

Model 2: y = b0 + b1*Level1 + b2*Level3 Level 2 = reference

Model 3: y = c0 + c1*Level1 + c2*Level2 Level 3 = reference

Predicted means and coefficient tests

Level 1 Level 2 Level 3

a0 a0+a1 a0+a2

b0+b1 b0 b0+b2

c0+c1 c0+c2 c0

1.1) The test of a1 is a test of whether Mean of Level 1 = Mean of Level 2

© Statistics.com, LLC Page 26


1.2) The test of a2 is a test of whether Mean of Level 1 = Mean of Level 3

1.3) No direct comparison of Level 2 versus Level 3

2.1) The test of b1 is a test of whether Mean of Level 1 = Mean of Level 2

2.2) The test of b2 is a test of whether Mean of Level 2 = Mean of Level 3

2.3) No direct comparison of Level 1 versus Level 3

3.1) The test of c1 is a test of whether Mean of Level 1 = Mean of Level 3

3.2) The test of c2 is a test of whether Mean of Level 2 = Mean of Level 3

3.3) No direct comparison of Level 1 versus Level 2

Level 1 not statistically different from Level 2

Level 2 statistically different from Level 3

Level 1 not statistically different from Level 3

What will each model say in their tests of coefficients?

Model 1: a1 = not significant; a2 = not significant

Model 2: b1 = not significant; b2 = significant

Model 3: c1 = not significant; c2 = significant

So, the results can be different. Even though the models don't directly perform the final
comparison, you can get those comparisons by specifying a post-hoc Wald test. Once you add in
the final comparison with a post-hoc analysis, then all models report identical results.

In the above, if you want to turn things into odds ratios, then just apply the logit transform of
the linear predictor where you set coefficients to one or zero for the specific variable levels you
want.

Subject: Log-Likelihood formation


Participant’s Question:
In the text (p.5), there is discussion about the benefits of using log likelihood (.e.g, "this makes it
much easier for the algorithms used to estimate distribution parameters to converge; that is s to
solve the estimates".
I have run across something similar in many publications but always wondered whether or not
this statement is reflective of older technology and the past. Is this still relevant with modern
solvers like what one would find in Excel or OpenSolver? Maybe another way to ask this is

© Statistics.com, LLC Page 27


whether you would expect to see performance (speed) differences depending on whether log
likelihood was used or not? Do you always use log likelihood formulation?
Instructor’s Response:
Imagine a very simple problem of having 4 independent observations (y1, y2, y3, and y4) from a
Bernoulli population characterized by a parameter p which has the following properties:
P(Y = 0) = 1-p
P(Y=1) = p
which implies that
P(Y=y) = (1-p)^(1-y) * p^(y)
Mean = E(y) = (1-p)(0) * (p)(1) = p
Variance = E(y^2) - E(y)^2 = (1-p)(0)(0) * (p)(1)(1) - p*p = p-p^2 = p(1-p)
To emphasize, distributions deal with probabilities of unknown data (y) for given/known
parameter values (p).
To estimate p, we consider the joint probability of all of the observations:

P(Y1=y1 and Y2=y2 and Y3=y3 and Y4=y4) = P(Y1=y1) * P(Y2=y2) * P(Y3=y3) *
P(Y4=y4)

because the four observations are independent (Y1, Y2, Y3, and Y4 all have the same Bernoulli
distribution - a distribution with the same value of p). Without the assumption of
independence, the joint distribution would have to consider covariance among the four
observations, so this is a really powerful assumption. The joint distribution can be written:

Likelihood of (p given y1,y2,y3,y4) = P(y1,y2,y3,y4) = [(1-p)^(1-y1) * (p)^(y1)] * [(1-


p)^(1-y2) * (p)^(y2)] * [(1-p)^(1-y3) * (p)^(y3)] * [(1-p)^(1-y4) * (p)^(y4)]

= product(i=1 to 4) { (1-p)^(yi-1) * p^(yi) }

The maximum likelihood estimate of p is that value of p that maximizes this joint
probability. That is, we treat the data (y1,y2,y3,y4) as known and treat the parameter as
unknown. Instead of calling it the joint probability, we call it the likelihood - we change it's
name to emphasize that we have changed what we assume is known and unknown.

This is a really easy problem, so we could just draw a picture: Imagine that (y1,y2,y3,y4) =
(1,0,1,0). In that case, what is the MLE of p? Let's just assume that we have no idea, but we do
know that for the Bernoulli distribution, 0<p<1. So, let's just draw a picture for these values.

In R:

y <- c(1,0,1,0)
p <- seq(0.01,0.99,len=99)
likelihood <- ((1-p)^(1-y[1]) * (p)^(y[1])) * ((1-p)^(1-y[2]) *
(p)^(y[2])) * ((1-p)^(1-y[3]) * (p)^(y[3])) * ((1-p)^(1-y[4]) *
(p)^(y[4]))
plot(p,likelihood)

What is the array likelihood? It is the probability that (y1,y2,y3,y4)=(1,0,1,0) for p=0.01, p=0.02,
..., p=0.99. There is a best value for p in the sense that it makes the joint probability as big as

© Statistics.com, LLC Page 28


possible. That is our goal. To find that value of the parameter than "best explains" the values
we have in our data. We can see in the graph that the best value of p is around p=0.50.

Is there a way that we could find the MLE of p? Something more formal than simply calculating
the likelihood for every conceivable value and looking at the picture? The answer is yes. In
mathematics, this is called a maximization problem or an optimization problem. Now, the
maximum of a function is also the maximum of the log of that function. Not too sure of that?

loglikelihood <- log(likelihood)


plot(p,loglikelihood)

Generally speaking, it is easier to deal with the log of the likelihood because the log of a product
is the sum of the logs: that is, log(a*b) = log(a) + log(b). That means that the log-likelihood from
the above likelihood function can be written:

Log likelihood of (p given y1,y2,y3,y4) = log [ product(i=1 to 4) { (1-p)^(yi-1) * p^(yi) } ] =

sum(i=1 to 4) log{ (1-p)^(yi-1) * p^(yi) } = sum(i=1 to 4) log{ exp[(yi-1)log(1-p) + (yi)log(p)] }

= sum(i=1 to 4) {(yi-1)log(1-p) + (yi)log(p)}

Now comes more math. To find the maximum of this function, we know that the function is
concave; that is, there is a maximum. On either side of that maximum, the function
decreases. If we were to draw a tangent line at each point on that function, the tangent line
would be horizontal (have slope=0) at the maximum and that is the only point at which the
tangent line would look like that. Equivalently, that means that the derivative of the function is
equal to zero at that point. So, whatever value of p makes the derivative of the function equal
to zero.

Derivative = sum(i=1 to 4) { -(y_i-1)/(1-p) + (yi)/p }

Let's set to zero and solve for p:

sum(i=1 to 4) { -(y_i-1)/(1-p) + (yi)/p } = 0

y1/p + y2/p + y3/p + y4/p = (1-y1)/(1-p) + (1-y2)/(1-p) + (1-y3)/(1-p) + (1-y4)/(1-p)

y1(1-p) + y2(1-p) + y3(1-p) + y4(1-p) = 4p - y1(p) - y2(p) - y3(p) - y4(p)

(y1+y2+y3+y4) - (y1+y2+y3+y4)(p) = 4p - (y1+y2+y3+y4)(p)

(y1+y2+y3+y4) = 4p

(y1+y2+y3+y4)/4 = p

average of the y values = p (= 0.50 in our particular case)

© Statistics.com, LLC Page 29


So, the MLE of p is the average of the observed values = (2/4) = 0.50. This is the same answer as
was seen in the plot, but done more rigorously. Now, in many cases, there is more than one
parameter. In those cases, we have to take a (partial) derivative with respect to each of those
parameters, set them all to zero and simultaneously solve those partial derivatives (set to zero).

Likelihood = product(i=1 to 4) { (1-p)^(yi-1) * p^(yi) }

= [(1-p)^(y1) * (p)^(1-y1)] * [(1-p)^(y2) * (p)^(1-y2)] * [(1-p)^(y3) * (p)^(1-y3)] * [(1-p)^(y4) * (p)^(1-


y4)]

= (1-p)^(y1+y2+y3+y4) * (p)^(4-y1-y2-y3-y4)

= (1-p)^2 * p^2

= p^4 - 2*p^3 + p^2

Derivative:

4p^3 - 6p^2 + 2p = p * [ 4p^2 - 6p + 2]

Set to zero has a trivial solution at p=0 which is not interesting since we need solutions 0 < p < 1

So, let's solve the quadratic 4p^2 - 6p + 2 = 0 which implies that

p = (6 +- 2)/8

which implies that

p=1/2 and p = 1

p=1 is not interesting since we need 0 < p < 1. Thus, the only interesting solution is p = 0.50

To emphasize, we did not have to take the log of the likelihood. We could have taken the
derivative of the product, set it to zero, and solved. However, the function is much more
difficult to work with. So, we don't take log because it is old-fashioned. We do it because it
makes these problems MUCH easier.

Subject: Over-Dispersion and Under-Dispersion


Participant’s Question:
The textbook lists possible reasons for apparent over-dispersion, but what would be some of the
reasons for under-dispersion?
I'm just guessing here, what about the condition where one of the predictors/covariants are,
what we amateurs say, 'leak' to the dependent or target?
And isn't under-dispersion when it occurs naturally and is not just apparent, is not that a good
thing if you want a usable model?
Instructor’s Response:

© Statistics.com, LLC Page 30


What you want from a parametric model is a specification that closely matches the data without
overfitting. So, you want an underlying distribution that models the conditional means, and
closely captures the conditional variances. The means tell us about various associations, and the
variances allow us to formally test those associations. For statistical tests to carry the meaning
we ascribe to them (level of significance and any calculation involving probability), the
conditional variance needs to be correct.
UNDER DISPERSION: If the conditional variance assumed by the model is too large (the variance
of the data is smaller than what the model assumes), the p-values are too small and the
confidence intervals are too wide (we miss identifying significant associations; we lose statistical
power; our Type II error rate is inflated).
OVER DISPERSION: If the conditional variance assumed by the model is too small (the variance
of the data is larger than what the model assumes), the p-values are too large and the
confidence intervals are too narrow (we over identify significant associations; we exaggerate
statistical power; our Type I error rate is inflated).
Whether Type I error or Type II error is worse can only be answered in context, so it really isn't
possible to decry one as worse than the other in a universal sense.
Another Participant’s Question:
I am trying to understand when underdispersion occurs. Since we are sampling, we should
expect a mix of under- and overdispersed data sets. However, I occasionally see vendors who
lose control of their processes and end up screening their components for shipment to their
customers. Generally, component parametrics are well-modeled using a normal distribution.
However, when vendors screen their parts they remove the tails of the distribution. Would this
also be considered an underdispersed situation?
Instructor Continued:
Before I address your specific question, I just want to re-emphasize some points because I get
the impression from the book and from some of the earlier discussion that some things aren't
clear.
Point 1: data are what they are. They are not over dispersed, they are not equip dispersed, and
they are not under dispersed.
Point 2: as an analyst, we choose a distribution upon which to model data
It is Point 2 that ultimately yields a statement about over/under/equi-dispersion. Data have
their conditional variance and that conditional variance relates to the conditional variance of the
model (that we chose). So, I just want to emphasize that there is no "fault" of the data that
leads to this conclusion. If there is "fault" it is with the analyst who did not choose a model that
matched the data. If this dispersion statement is made because there are missing covariates,
then, again, it is the analyst and not the data at fault.
I am trying to understand when underdispersion occurs. Since we are sampling, we should
expect a mix of under- and overdispersed data sets.

No, the data are what they are. They have conditional variance. That variance needs to be
matched by your chosen model. Now, it certainly may be true that data from one source may
have conditional variance that is lower than the conditional variance of data from another
source. So, I don't like these statements because they sound like "my model is good, but
sometimes data are bad which leads me to having to change my good model".

The real statement is "I have a standard first model that I always consider. For the most part all
data that I receive are adequately modeled by this first choice. On occasion, the data are
actually over dispersed relative to my model. On occasion, the data are actually under

© Statistics.com, LLC Page 31


dispersed relative to my model. When data are not adequately modeled, it is more common
that they are over dispersed relative to my model than that they are under dispersed." OK -
sorry for that aside; it’s really not part of your question. I just wanted to take the time to write
down the information because I think it isn't all that clear in the book.

However, I occasionally see vendors who lose control of their processes and end up screening
their components for shipment to their customers. Generally, component parametric are well-
modeled using a normal distribution. However, when vendors screen their parts they remove
the tails of the distribution. Would this also be considered an underdispersed situation?

OK - so, this is a really great question because it asks about the subject matter of
over/under/equi-dispersion in a continuous outcome model as opposed to a discrete outcome
model. The answer (for the most part) is that you don't have to worry. At least, you don't have
to worry as much as you do when you have a discrete outcome. The reason is that in
continuous distribution models, the variance has an additional parameter that allows much
more flexibility to estimate the conditional variance. So, the problem just doesn't arise. Now, in
the particular case that you brought up, what if the assumption of a continuous outcome
modeling outcomes on an infinite (or, at least, very wide range) just isn't really true? Shouldn't
we consider truncated versions of those distributions as we do when we consider truncated
distributions for count outcomes in restricted ranges? The short answer is: sure - even if it
doesn't gain us all that much efficiency. There are continuous outcome models based on
truncated distributions. Some specific ones are tobit regression and truncated
regression. While they aren't called over/under dispersion situations, they do explicitly treat the
limited range of the outcomes.

Subject: Grouping Data


Participant’s Question:
In my own work, I can't think of a reason to group my data in the manner described in the book.
Why would an analyst decide to go this route? I can see that it might speed up model
calculations, but otherwise it just seems to complicate matters.
Instructor’s Response:
And the answer is that you would probably NEVER want to create a grouped dataset. Such
things might have been common 30 years ago when we wrote programs on punch cards and
worried about having too many observations than would fit into memory. Nowadays, that isn't
such a concern.
I can tell you that I NEVER put data into this format. I sometimes keep it in frequency-weighted
form, but even that is not that commonly done (for me).

Subject: Adjusted Model SEs


Participant’s Question:
The first homework problem this week refers to adjusted model SEs. I must admit that I still do
not have a good feel for this adjustment process. Let me try a few questions:

 Can you give me a simple way to understand why the model SEs need to be adjusted?

© Statistics.com, LLC Page 32


 Does R automatically perform the adjustment when used with grouped data (i.e. when
it is given the cases command)? If not, what actions do I need to take to force the
adjustment to occur?
 Should it be a standard procedure to evaluate the Pearson dispersion to determine if it
is overdispersed? If R automatically applies the adjustment, it would seem unnecessary.

Instructor’s Response:
Let's list out some facts and then show how these various choices of standard errors relate to
those facts.
(1) If we assume that the outcomes follow a distribution (Bernoulli, Binomial, Poisson, etc), then
the model-based standard errors are a consequence of that distribution. That is, the standard
errors are driven by the variance of the assumed distribution.
(2) If the data really do come from the assumed distribution, then the standard errors from the
distribution will be "correct".
(3) If the conditional variance of the data closely matches the conditional variance prescribed by
the assumed distribution, then the Pearson dispersion statistic will be about 1. Because the
conditional variance of the data is the same as the conditional variance prescribed from the
distribution, the model-based standard errors from the assumed distribution are "correct" - the
right size to ensure that statistics based on those standard errors will have their prescribed
properties.
(4) If the conditional variance of the data is less than the prescribed conditional variance, then
the data are underdispersed relative to the assumed distribution and the Pearson statistic will
be less than 1. Because the conditional variance of the data is smaller than the conditional
variance prescribed from the distribution, the model-based standard errors from the assumed
distribution are too large. If we multiply the model-based standard errors by the Pearson
statistic, they are probably closer to the correct size (though we don't have any compelling
theoretical justification, and we really don't know how small the Pearson statistic should be to
compel us to make this adjustment).
(5) If the conditional variance of the data is greater than the prescribed conditional variance,
then the data are overdispersed relative to the assumed distribution and the Pearson statistic
will be greater than 1. Because the conditional variance of the data is larger than the
conditional variance prescribed from the distribution, the model-based standard errors from the
assumed distribution are too small. If we multiply the model-based standard errors by the
Pearson statistic, they are probably closer to the correct size (though we don't have any
compelling theoretical justification, and we really don't know how small the Pearson statistic
should be to compel us to make this adjustment).
(6) If we believe that 4 or 5 is the case, then in a very real sense, we are saying that the assumed
distribution is the wrong one. Not for the mean, but for the variance. When we say that, we
acknowledge that the regression coefficient estimates are correct, but the incorrect standard
errors will lead to incorrect inference.
(7) If we decide that there just isn't any way to figure out the correct distribution, we can derive
an alternate standard error that is "distribution-agnostic". The sandwich variance estimate
combines the model-based variance estimator with a non-parametric
variance estimator. Because of this, it more closely resembles the variance of the observed data
when the data do not conform to the assumed distribution.
(8) If we use (7), then in a very real sense, we are saying that the assumed distribution is the
wrong one.

© Statistics.com, LLC Page 33


(9) Whether we use robust standard errors (from the sandwich variance estimate) or adjusted
(scaled) standard errors, we are saying that the assumed distribution is wrong, and that we
really don't have all that much confidence in the specified log-likelihood. We are comfortable
that the regression coefficients are adequately modeling the conditional means, but equally
uncomfortable that the model captures the conditional variances; thus, we are using alternate
standard errors. We should NOT use these alternate standard errors and continue to use any
likelihood-based statistics.

OK - now let's answer your questions.

Can you give me a simple way to understand why the model SEs need to be adjusted?

We use standard errors for things like building 95% confidence intervals. 95% confidence
intervals have the property that if experiments are carried out over and over, then 95% of those
intervals will contain the true parameter value. If the standard errors are wrong, then maybe
99% of the intervals will contain the true parameter, or maybe 73% will contain the true
parameter. A standard error is used to create statistics for hypothesis testing. If the standard
error is correct, then the statistic is at a level of significance that we say. If the standard error is
wrong, then the true level of significance might be smaller or larger than we say.
We adjust SEs to ensure that the statistics we compute have the properties that we say.

Does R automatically perform the adjustment when used with grouped data (i.e. when it is given
the cases command)? If not, what actions do I need to take to force the adjustment to occur?

No. That a model is binomial instead of Bernoulli is irrelevant. Data can be over dispersed
relative to the Bernoulli just like they can be over dispersed relative to the binomial. So, there is
no distribution-specific reason to automatically scale. You can decide to scale if the Pearson
statistic is bigger than 2 or smaller than one half (those are common rules of thumb). Or, you
can decide to use robust standard errors. Or, you can decide that the evidence is not compelling
enough to use alternate standard errors.

Should it be a standard procedure to evaluate the Pearson dispersion to determine if it is


overdispersed? If R automatically applies the adjustment, it would seem unnecessary.

Perhaps. I am of the opinion that such things should be considered in context. If you want, you
can run your analyses under a rule of thumb about alternate standard errors always being
applied with the Pearson statistic is bigger than 2 (for an example). In those cases, you must
then never rely on any likelihood-based statistics (AIC, LR tests, etc) because, after all, you have
"declared in a sense that the model is wrong". So, you see, applying the standard errors is a
real knee-jerk reaction to fixing a problem, but it has a cost. There are no free lunches!

Subject: Cut Points


Participant’s Question:
I have a question about cut points. For the ROC analysis, there is only one cut point (balance
between specificity and sensitivity) which makes sense to me. What can we do if we want to
validate an assessment that has more than one cut point. For example, in my field (criminal

© Statistics.com, LLC Page 34


justice) we often use assessments to classify offenders as "low risk", "moderate risk" or "high
risk" to reoffend. These classifications are based on an assessment score (low = 0-2;
moderate = 3-4; high = 5-6). If we wanted to identify the most appropriate cut points but there
are more than two groups how could that be done? So, from the scale above, how could I tell
that low risk should optimally be 0-2 rather than 0-1 or 0-3?
I imagine there are lots of similar examples in the medical field where individuals are classified
as low, moderate or high risk of a disease. My question is how can we identity those cut off
thresholds with multiple groups?
Instructor’s Response:
Your question does indeed make sense! In this course, we are focused (almost exclusively) on a
binary outcome. Our model produces predicted conditional probabilities of Y=1 which of course
we can always subtract from 1 to estimate conditional probabilities of Y=0. In your question,
you are considering an outcome variable with 3 levels (let's start there before we generalize to
k-levels). It turns out that there are many ways to operationalize such a model. Generally, to
make a model identifiable, we have to define a referent level. In the binary case, Y=0 is the
referent level and the coefficients are all part of an attempt to model Y=1. We can do the same
thing for 3 levels where we attempt to estimate P(Y=2) and P(Y=1) and then let P(Y=0) = 1 -
P(Y=2) - P(Y=1). We might want to assume one set of covariates for P(Y=2) and another one for
P(Y=1). Or, maybe we might think that's too hard, and instead we want to use the same set of
covariates for P(Y=2) and P(Y=1). There are lots of other simplifications as well.
In the binary case, we choose a outpoint (c) where we say if the predicted probability is >= c,
then we will predict Y=1; otherwise, we will predict Y=0. We compare our predicted values to
the observed values. We have to define what the optimal cutoff is. What are we trying to
optimize? The Youden index is equal to sensitivity + specificity - 1. There is a value of the cutoff
that maximizes this index. However, there are other criteria by which we could judge our
choice.
With three possible outcomes it is more complicated, and it depends on the type of model we
estimate. Let's say we take a general approach in which outcomes Y=1, and Y=2 are compared
to Y=0. The same set of covariates are considered, but we make no restrictions that the
coefficients must be the same in these two comparisons. In that case, we end up estimating a
set of coefficient for Y=2 versus Y=0, and for Y=1 versus Y=0. We could imagine two outpoints:
one for each of these comparisons, and we would still need a third outpoint to decide between
the better choice of the first two comparisons. The model being described here is called
multinomial logistic regression. If we make the simplification that the outcomes are ordered
and the coefficients have to be the same for the ordered comparisons Y=1 versus Y=0 and Y=2
versus Y=1, then we are describing the ordered logistic regression model. If we restrict some of
the coefficients to be the same, but not all of them, then we are describing a generalized
ordered logistic regression model.
No matter how may outcome levels we consider, the cut points can only be chosen once you
define what property you are trying to optimize with the decisions reached relative to the cut
points.

Now, all of the above discussion on my part is on the definition of cut points as are used to
decide what predicted level is described by a model. Another definition of cut points is in how
best to convert an M-outcome level (or continuous) variable into a K-level variable where
K<M. For example, you might ask people what level of education they have, and get answers
1=1st grade, 2=2d grade, ..., 12=graduated high school, 13=college, 14=post-graduate. You may
want to turn this 14-level variable into a 3-level variable by defining, let's say, 1-8 (less than high

© Statistics.com, LLC Page 35


school), 9-12 (high school), 13-14 (college). Such cut points are chose either from meaningful
definitions as in my example here, or by convenient mathematical definitions. For example, let's
say we have income, and we summarize that income and identify the quartiles of the variable,
thus allowing us to define 4-levels of equally-represented income. Mathematically balanced for
our analysis, though not necessarily meaningful in any economic, or policy way. For some
measures, there are standard accepted cut points, and for others you are left to your own
devices. Suffice it to say, that no matter what you choice, you will run into someone who
disputes what you have done and will want you to do it another way (manuscript referees,
proposal reviewers, and other evil beings). Because there is no right answer, people can
become quite insistent that their definitions are the most sensible.

Subject: Continuous Variables


Participant’s Question:
For some continuous variables, like age, there are still finite levels that would occur in the range
of the study being analyzed - like we have seen with the prior data sets. But there are also some
continuous variables that within the capabilities of the measurement system, can take on a
considerably higher number of distinct values in the analysis and model, e.g. dosage levels or
incomes. For Logistic regression, is it okay to leave these types of variables at their 'finest'
resolution or is it okay to bin these values into groups, while still treating them as a continuous
variable? For example, income may be binned into groups of $1,000 increments; where within
the full range of possible values, this would still appear as a continuous variable.
If it is recommended to group continuous variables into more bins, is there a rule of thumb on
how to do this?
In Logistic Regression, is it a must that we bin the variable and use them to predict or is it ok to
use continuous variables directly and how would the output give us indications on where the
significance is?
Instructor’s Response:
If we leave a small interval for accessing a one unit change, the coefficient is going to be very
small and more difficult to interpret. For age it’s pretty easy: a one year increase in age has an X
log-odds increase in the probability of y, however we define X and y. If our unit is a conceptually
much small unit, then the coefficient will be small. Sometimes analysts will increase the size of
the coefficient, or odds ratio, by increasing the size of the "one-unit" increase; e.g. 2 or 5 year
intervals.
I don't know of any rule of thumb. It depends on the situation and how one wants to interpret
the continuous predictor. Here's the problem when dealing with continuous predictors. If a
continuous predictor is in a model, you are assuming that the slope of the predictor is the same
across the entire range of values. This is rarely the case with real data. The predictor's impact on
the model is using a single slope. I always check a continuous predictor using a partial residual
plot of Generalized Additive Models. This allows me to see the changes in slope of the variable,
as adjusted by other predictors. If there are major shifts of change in slope, I factor the predictor
a those points. If there are three major changes in slope, I'll cut the predictor at those points and
will have a 4-level predictor. When modeled the levels should have significantly different
coefficients (slopes). I can interpret the data better by doing this, even though i lose some data
by categorizing the predictor. The fact that it is not in fact characterized by a single slope trumps
the loss of data information contained in the predictor.

© Statistics.com, LLC Page 36


Subject: Odds Ratio, Odds and Probability
Participant’s Question:
I am still slightly confused about the relation of odds, odds ratio and probability. I was following
along fine up to the point where a probability was computed from an odds ratio. I understand
that the odds ratio of white=1 is
OR = exp(.7703 + (-.0477*x))
Where x is varying levels of los.
So the first entry in the table with los=1, says that for los=1 the odds ratio of white to nonwhite
is 2.06 or that the odds of white=1 death is about 2 times that for nonwhite at los=1.
What I having difficulty understanding is what the P that is computed in formula 6.10 and table
at the bottom of the page means.
It isn't the probability of died given white=1 and los is in the set (1, 10, 20,..., 100) because we'd
need to also include the constant in the computation (essentially we'd need to compute
exp(XB)). Or in the case of white=1 and los=1, the odds of death using the coefficients on pg
213:

exp(.7703072*1+.0100189*1+(-.0476866)*1*1+(-1.048343))
= 0.7292756
so the probability of death given white=1 and los=1 is
= Odds/(1+Odds)
= 0.7292756/(1+0.7292756)
= 0.4217232
So I guess my question is how to interpret the probabilities at the bottom of page 214.

Instructor’s Response:
Usually a researcher is only interested in calculating odds ratios for coefficients, where an odds
ratio is the ratio of the odds of one level of the predictor compared to the odds of the reference
level of the predictor. Therefore, if die is the response and gender (1-male;0-female) is the
binary (1/0) risk factor, or explanatory predictor, and the odds ratio given for gender is 2, then
the odds of death or males compared to females is 2, or males have twice the odds of death
than females, etc. Odds ratios in this respect deal with the model as a whole, and represent the
mean for all observations.
The second statistic that is normally of interest to researchers is the probability of y==1. For our
example above, we would be interested in the probability of death for each individual in the
data. Whereas parameter estimate odds ratios are with respect to the model as a whole (the
mean for all individuals) the probability is with respect to each individual. Actually, probabilities
are calculated for covariate patterns, since they consist of identical values. We convert the linear
predictor, xb, into the probability of y==1 by calculating for each individual, 1/(1+exp(-xb)) or
exp(xb)/(1+exp(xb)).
However, we can also convert the individual probabilities to the odds of an individual dying, for
example, by using the formula given in the book. This is rarely used in research, but it has been.
Given our knowledge that the odds of x are defined as p/(1-p), where p (probability) is from 0 to
1, the odds of x==1 is calculated as

odds1 = p1/(1-p1)
that the odds of x==0 is
odds0=p0/(1-p0)

© Statistics.com, LLC Page 37


These are odds for each individual. Now take the mean value of each odds1 and odds0 for all
individuals in your data and divide odds1/odds0 and you'll get the model odds ratio for the
predictor.
OR = mean(odds1)/mean(odds0)
Let's try it for our example. Recall that both died and white are binary.

. logit died white, nolog


Logistic regression Number of obs = 1495
LR chi2(1) = 2.26
Prob > chi2 = 0.1331
Log likelihood = -960.30439 Pseudo R2 = 0.0012
-----------------------------------------------------------------------
-------
died Coef. Std.Err. z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------------
-------
white | .3025126 .2049038 1.48 0.140 -.0990914 .7041167
_cons | -.9273406 .1968928 -4.71 0.000 -1.313243 -.5414377
-----------------------------------------------------------------------
-------
Obtain the odds ratio of white
logit died white, nolog or
<header stats the same as above>
-----------------------------------------------------------------------
-------
died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+---------------------------------------------------------
-------
white | 1.353255 .277287 1.48 0.140 .9056599 2.02206
_cons | .3956044 .0778917 -4.71 0.000 .2689463 .581911
-----------------------------------------------------------------------
-------
Get the predicted probability for each white==1 and white==0. This is the same as, for example
for white==1 and white==0 respectively

. di 1/(1+exp(-(_b[white]*1 + _b[_cons]))) /* or 1/(1_exp(-(.3025126-


.9273406))) */
.34868421
. di 1/(1+exp(-(_b[white]*0 + _b[_cons]))) /* or 1/(1_exp(-(-
.9273406))) */
.28346457
The same values are created by Stata using the code:
. predict mu1 if white==1
(option pr assumed; Pr(died))
(127 missing values generated)
. predict mu0 if white==0
(option pr assumed; Pr(died))
(1368 missing values generated)

Now let's convert each level of probability to an odds.


. gen odds1 = mu1/(1-mu1)
(127 missing values generated)
. gen odds0 = mu0/(1-mu0)
(1368 missing values generated)

© Statistics.com, LLC Page 38


. su odds*
Variable |Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
odds1 |1368 .5353535 0 .5353535 .5353535
odds0 | 127 .3956044 0 .3956044 .3956044
And divide odds1 by odds0 and we get the model OR on white
. di .53535353/.3956044
1.3532547

We get the same 1.353255 value for the odds ratio as displayed in the model above. You may
convert back to the predicted probabilities for each individual by using the following code.
prob1=mu1 and prob0=mu0.
. gen prob1=odds1/(1+odds1)
. gen prob0 = odds0/(1+odds0)

I hope that this clarifies what is happening. I don't believe you'll find this breakdown of the
relationship of odds, OR, probability, and observation odds elsewhere, but it is does all fit
together remarkably well.

Subject: Some useful sites


Instructor’s Comment:
I would like to advise you all of an excellent web site on which you can obtain a great deal of top
level statistical information. The site is free, and its address:
www.statprob.com
It is the relatively new online encyclopedia sponsored by the foremost statistical and probability
societies worldwide. Members from each society (e.g. American Statistical Association, Royal
Statistical Society, International Statistical Institute, Institute of Mathematical Statistics, Chinese
Statistical Society, etc.) review and approve articles in various areas within statistics. It started
by the organizing committee asking 100 authors of articles in the 3 volume International
Encyclopedia of Statistical Science (2010, Springer) or IESS, if they would consent to have their
encyclopedia articles as the initial block of Statprob articles. There were a number of articles
about statisticians from history already in it, and one substantial article. I don't recall what it was
any more. But that was it until about February of this year. Now there are quite a few articles,
and growing.
Anyhow, if you want to know about a particular area of statistics, check out the encyclopedia,
which appears as Wikipedia style pages. Items in the articles are cross referenced to other
articles, making it a great educational tool. It is intended that StatProb will be around
permanently - insofar as that is possible.
I have 3 articles in StatProb, which came from the IESS encyclopedia: Logistic Regression,
Modeling Count Data, and Generalized Linear Models. I invite you to check out the logistic
regression especially. But there are many others which may interest you as well.
Again: www.statprob.com
then click on articles, or just look around and select what you want.
Also, many of you might be interested in purchasing texts on various statistical topics. The Stata
bookstore has over 100 (or about) books on statistics, most of which do not involve Stata, but
are generally the top books in the field. The nice thing is that the prices almost always are better
than Amazon or other sites. You can also have them sent to anywhere. Go to
www.stata.com/bookstore/

© Statistics.com, LLC Page 39


to get folders for "Books on Statistics", "Books on Stata", and books published by Stata Press. I
suggest clicking on the first of these. Many have ancillary information about data sets and code
associated with the book, and a brief description of the book. Books come from all of the major
publishers of stat books, and the prices are generally great. The only books they do not sell are
those that are about other software. No, don't get a kickback on sales. Rather, I just want to
pass on the better costs to you. Some stat books are very expensive.

Subject: Model fit, deviance and number of observations


Participant’s Question:
In section 7.1.2 (reference Logistic Regression Models, Joseph M. Hilbe, 2009)
, where the book talks about Deviance statistic, looking at the table at the bottom of pg 247,
when trying to understand the results, it looks like the models were fit with different number of
observations.
So the constant model was fit with all the data, 5388 observations, but due to NAs the model
with anterior was fit with 4696 observations. I'm wondering if the Chi2 test will work with
different number of observations, or if we need to use the same set of observations for all the
models for the Chi2 test to make sense here?
Instructor’s Response:
Great question. This method was first proposed by McCullagh and Nelder in their seminal book
on the subject, Generalized Linear Models (1983, 1989), Chapman & Hall. It has been used by
others as well, but not much anymore. Basically the method progresses from using an intercept-
only model to adding more predictors and interactions, checking to see if the mean difference in
adding a predictor is comparatively low. The example in the book shows that adding hcabg is
likely not significant. Likewise for the interactions.
I deleted all observations with missing values and came up with the following table of main
effects

....MODEL....DEVIANCE....DIFFERENCE....DF....MEAN_DIFFERENCE
=============================================================
intercept... 1486.290

MAIN EFFECTS
anterior.... 1457.719........28.500.....1........28.500
hcabg........1453.595.........4.124.....1.........4.124
killip..........1372.575........81.020.....3........27.001
agegrp.......1273.652........98.923.....3........32.974
=============================================================
The values are different, but the results are clearly the same. If you would like to complete it for
the interaction of anterior and killip levels and hcabg and age levels, it would be interesting to
see the results. I strongly suspect that it will be the same.
Your question had to do with differences in observations between some of the component
models. I looked up the original M&N source, and some recent sources that show this method,
and no one mentions differences in observations due to missing values. However, I think you
make a valid point.
Participant’s Comment:
I've been playing around a little more, and in R, the function anova() given a single model as an
argument generates a similar table.

© Statistics.com, LLC Page 40


>heart01.all.glm<glm(death~anterior+hcabg+killip+agegrp+anterior:killip
+hcabg:agegrp, heart01,family=binomial)
>anova(heart01.all.glm, test="Chisq")

Df Deviance Resid. Df Resid. Dev P(>|Chi|)


NULL 4502 1486.2
anterior 1 28.500 4501 1457.7 9.37e-08 ***
hcabg 1 4.124 4500 1453.6 0.042289 *
killip 3 81.020 4497 1372.6 < 2.2e-16 ***
agegrp 3 98.923 4494 1273.7 < 2.2e-16 ***
anterior:killip 3 12.750 4491 1260.9 0.005209 **
hcabg:agegrp 3 1.110 4488 1259.8 0.774759

So these values seem essentially identical to your table. With respect to the book, hcabg is now
significant, the Df for the interaction terms are 3 instead of 4, and it looks like the interaction
term of anterior & killip is significant here (it certainly has a noticeably larger Deviance
difference than the value in the book).
What got me thinking along these lines was an analogy to likelihood ratio tests where the
models have to be nested and use the same set of observations.

Subject: Missing value questions


Participant’s Question:
If there is a variable that theory/physiology etc. suggest could be important but there are a large
number of missing values the size of the final model will be much smaller than it could have
been. An example could be if we have a large dataset with 5,000 cases and there is a potentially
important variable with 2000 missing values. The model we develop retains this variable and
there are 1750 cases in the "final" model. Alternatively we develop a model excluding the
variable with all the missing values (regardless of its potentially important value) and our final
model now has 4000 cases. The AIC/BIC's of the much smaller model are less. Is this a case when
we should use an AIC/BIC adjusted for "n"? If we use the "mark out" etc. sequence in Stata we
are reducing the models so we can compare by LRT, but it is using a reduced dataset. Are we
better off forsaking the benefits of study size for model fit? With so many missing values, there
must be a risk of bias. How can we be sure that if there were no missing values and the variable
were assessed over 5000 cases that it would still be significant? Can list wise deletion lead to
bias in this situation? 2. From the example in the book (section 7:4:1.5) and a dataset I was
playing with the Standardized Deviance Residual seems to identify a lot of outliers in the
outcome==1 group (using the >4 cutoff). Is this typical? Is it suggesting that we should be doing
something with these variables (as when other methods identify significant outliers), or is this
test just telling us that the model or some covariate patterns are not predicting outcome==1 as
well as we would like? 3. At what point in data exploration/analysis do you recommend
considering removing extreme values if you have no way of checking the accuracy of the data?
Instructor’s Response:
Missing values can indeed be a problem with real data. I you know what the cause of the
missing values is, you can conceivably consider it to be a separate level. If so, recode the missing
values to a number, eg 2, and make into a categorical variable. The level for missing value can
perhaps tell us something of importance. You may even want to make it the reference level. I
suggest this possibility when there are lots of missing values in variable.

© Statistics.com, LLC Page 41


If the variable is continuous, then the above way is not really available. You may want to convert
the continuous variable to be a categorical variable, and take the able method. If no meaning
can be assigned the missing value; e.g. its MAR, then you may have to drop the variable as a
predictor, or accept the substantial smaller model. You are correct about the value of using
AIC/n, which I like as the main AIC statistic. Note that AIC and BIC statistics that have been
designed to be better than the older versions ALL have n as a divider, or at the least as an
adjuster. List wise delation and bias - I think it depends on the data.
There are different opinions about this in the literature. If there is no way to know if the
extreme values are real, or are mistakes, and there are a few very extreme values that result in a
change in the parameter estimates, etc., then many statisticians will drop them and model
without them, but making note of what was done in the research report. That's important.
Remember, a model is just that - a model. It tries to capture the essence, so to speak, of the
data, and is not an exact replication of the data. It is purposefully meant to be nebulous or fuzzy
edged.
You use >4 for squared std deviance residuals. For logistic models this usually gives you a V or U
shaped residual. Values in the cusp are influential, but those on either horn that are over 4 are
outliers, and should be treated as such. Again - what you do with them depends on the data,
and what you know about them. Usually it is wise to look at the profile of the primary outliers to
see if there is something mutually distinctive about them. Sort the data by Sq residual and list
the response, sq residual and covariates for the highest sq residual values. There are a number
of considerations to take when considering how to handle outliers for a particular model. The
foremost point is to determine if its an otherwise well fitted model. If it's a horrible model, then
residuals are the least issue to worry about.

Subject: Residual analysis


Participant’s Question:
The book talks in section 7.4(reference Logistic Regression Models, Joseph M. Hilbe, 2009)
About the importance of residual analysis for the model fit. My question is how to choose
between the different residuals? Do you check all of them or are there some that are more
commonly used? Also, can you recommend (or post) one of your research articles, where you
did a logistic regression and discussed the model fit? I think it would help me to get a better idea
of how to analyze the model and discuss it. I would greatly appreciate it.
Instructor’s Response:
I consider the squared standardized deviance residual vs the fit (mu) to be the best overall
residual test for a logistic model. I also like to weight the residuals by the delta beta statistic. See
281-284.
For observing the effect of a variable on the distribution of a 2-plus categorical variable over
time is also very informative, and can be captured using a conditional effects plot. I give the
code on how to set it up in the book. Residual analysis is, of course, best if at least one variable
in the graph is a continuous variable. If your data is mostly binary or categorical, then I rarely
use any residual graphs.
The key point is to find a graph for your data that helps bring out information about the data
that cannot (easily, at best) be obtained by a single statistic. Conditional effects plots are great,
and are commonly used in journals. Also desirable are residuals to see if there are influential
covariate patterns or residuals that help identify true outliers. These are discussed in the text.

© Statistics.com, LLC Page 42


I wrote a 50 or so page handout for a workshop on logistic regression some time back. I focused
on residuals of every sort - what they do, and how to construct and interpret them, and so forth.
If I can find it I'll pass it along to the class, or post it to a new site I have for download. BePress
kindly gave me the site so that I could post articles, errata for books, e-books, or whatever I
wanted. I put some articles and papers on it in February, but never advertised it. I received my
first report this morning on its usage, and was totally surprised by the number of downloads. I
will add to it as I find the time, but there might be something that will interest you. The web
address is: http://works.bepress.com/joseph_hilbe/
If I find the handout, I'll scan it and post it to that site. Otherwise, I really don't know of a place
ywhere ou can find more graphics for logistic regression models. If you are interested in Stata
graphics, I recommend Michael Mitchell's "A visual Guide to Stata Graphics" Stata Press. The
Stata bookstore has well over 100 books on statistics, many of which have nothing to edo with
Stata at all. You can check out "Books in Statistics" for a very good listing of top statistics books,
or "Books for Stata" for texts that use Stata for statistical analysis. See
www/stata/com/bookstore.

Subject: quasibinomial
Participant’s Question:
I noticed that when using the 'quasibinomial' family in the R glm function that the results do not
produce an AIC value. What is the best way to compare models built using the 'binomial' vs.
'quasibinomial'?
I am not confident in understanding in how to confirm if there is an issue with over-
dispersion. Can you clarify how to confirm if over-dispersion is a problem within the model
being analyzed?
Instructor’s Response:
The reason is that scaled SEs are not directly derived from the model variance-covariance
matrix. Remember, SEs are the square root of the diagonal terms of the V-C matrix. This is easy
to see when using R to determine the SEs of a model. If I ran a logistic regression in R with the
glm function, naming the model "mymodel", and wanted to create a vector of predictor SEs to
use in some other operation, I would need to use the following code: se <-
sqrt(diag(vcov(mymodel))). Anyhow, the V-C matrix is based on the model log-likelihood
statistic. But scaled SEs are not calculated that way. yes, we start with the model SE, but then
multiply them by the square root of the Pearson Chi2 dispersion statistic. This means that you
cannot use the log-likelihood to compare models. And since the AIC and BIC statistics are based
on log-likelihoods, they are not applicable for scaled (or quasi-) models.
To assess if a logistic model or a scaled logistic model (quasibinomial) is preferred, run the two
models and check the values of the SEs. If there has been a change -- especially a substantial
change -- in their values, then the scaled or quasibinomial model is preferred. it has adjusted the
SEs for the extra correlation in the data. Technically, scaling attempts to provide values of SEs
that would be the case if there were no extra correlation in the data. The coefficients (or odds
ratios) are unchanged in the two models. What I have said here also applies to overdispersion or
extra correlation in grouped logistic models. The best way to adjust for overdispersion in
grouped models (indicated by having the dispersion statistic be greater than 1.0) is to run a
beta binomial model. Check the AIC and BIC statistics of both models and choose the model with
the lower AIC/BIC value. For Bernoulli models I suggest that you employ robust or sandwich SEs
when modeling the data. Compare the SEs with a logistic model where the model SEs are left

© Statistics.com, LLC Page 43


unadjusted. I suggest using sandwich SEs in preference to a quasibinomial model. Then check
the unadjusted model for goodness of fit by using ROC analysis, a specificity-sensitivity plot, a
classification plot, and a Hosmer-Lemeshow test using 8, 10 and 12 groups, making sure that all
three give the same results.

Subject: Log-Likelihood
Participant’s Question:
1. Need to understand what is the log-likelihood, I've always seen this number in the analysis
output, but what it actually means? and what it's value telling us and how to interpret it?
2. Let's say we have continuous predictor variable, over the range of continuous number, how
can we group them into category data, what is the method to facilitate this objective, any rule of
thumb?
3. All the examples and analysis procedure in this course are presented in Stata
and epidemic field focus. My background is in Engineering and Minitab is the software I used for
this course therefore I'm having hard time to follow up in some concepts. Just curious if you
know or have any resources to show logistic regression application in engineering field with use
of Minitab, I probably understand further more
I appreciate for your info.
Instructor’s Response:
1) See page 63-65(reference Logistic Regression Models, Joseph M. Hilbe, 2009). All GLM-based
models, and all parametric statistical models in general, are based on an underlying PDF
(probability distribution function). The likelihood is simply a re-parameterization of the PDF such
that given a set of fixed observations, the function is used to determine the most likely
parameters that produce the observations. In other words, the likelihood function solves for the
parameters that make the observations or response term (y) most likely. Recall that a PDF start
with given parameters and calculate the observations. The likelihood function works in the
opposite direction.
Like the PDF, the observations in the likelihood function are assumed to be independent. The
product of all terms of the PDF equal 1.0. However, numerically it is much easier to deal with
sums of terms rather than products. The natural log of the likelihood function is called the log-
likelihood function, and is used for all maximum likelihood algorithms.
Bayesian analysis, however, uses the likelihood, multiplying it by a prior probability distribution
to obtain a posterior distribution. The mean value of the posterior distribution of an individual
predictor is the Bayesian coefficient.
2) Continuous predictors are usually categorized or factored into groups depending on the
nature of the distribution of the predictor. There is no hard fast rule though. If a continuous
predictor has a smooth range of values, usually one will divide it into 3 to 5 equally sized groups.
Or, if the variable is not smooth and seems to almost have breaks in the range of values, it may
be wise to categorize based on the natural breaks in the data. There are a variety of different
categorizing strategies. Which you employ depends on the data and how you’re looking at it.
3) You can likely find many applications in engineering which can be modeled using logistic
regression. In fact, there are several books I know of that are on Generalized Linear Models
authored by engineering profs for their students. Doug Montgomery at Arizona State (where I
am also located) is probably the most famous of these. The author of SAS's Genmod procedure,
which estimates SASs GLM, GEE, and all GLM-based Bayesian models, has his PhD in

© Statistics.com, LLC Page 44


engineering. If your response term is binary, a logit or probit model will very likely be the most
appropriate model to use.
With respect to Minitab. I understand that it is easy to use, which makes it attractive in that
respect, but when they sent me a copy of the package for review I found that it seemed to have
few capabilities for areas of statistical analysis that I thought were important. I know that some
really like it, but I've never used it, so can't give you advice. Perhaps someone else in the course
might have some experience with it. It is used as a teaching tool in a number of colleges and
universities.

Subject: P-value of the intercept


Participant’s Question:
just for curiosity, in a logistic regression model, what happens when the only p-value higher than
0.05 is the intercept's p-value. The p-values of all other predictors are higher than 0.05.
Instructor’s Response:
It appears that you may not have a well fitted model. Before throwing it out though, see if there
is extra correlation in the data that is affecting the SEs. Try to scale by SEs. For instance, in Stata,
model as

. glm y xvars, fam(bin) nolog scale(x2)


Or try a robust or sandwich estimator
. glm y xvars, fam(bin) nolog vce(robust)
If the SEs and p-values do not change much, then there is little extra correlation and model SEs
are fine. If they are different, then you need to identify and adjust for the extra-dispersion. If
you do and the p-values are still >.05, then try removing various predictors, try some
interactions, etc. If it still does not work, try modeling with a cloglog or loglog regression. Or a
probit - but typically logit and probit differ by only a little (actually they differ by a constant). If
still not work, you can try nonparametric methods, which we do not discuss in the course.
Flexable polynomial logistic regression might be a strategy. If it is a small model you can try
exact logistic regression or Monte Carlo logistic regression, which is based on simulation rather
than maximum likelihood. Stata and R can do all of these methods, but they are the only ones. I
discuss exact logistic regression and Monte Carlo in the last chapter of the book. Flexible
polynomial LR is a fairly new method, with the author first coding it in Stata. I'lll discuss these
methods in the Advanced Logistic Regression course which starts after this one has concluded.

Subject: Binary Logistic Model


Participant’s Question:
Need further confirmation if I understood it correctly.
For binary logistic model, we will never have over dispersion unless if there is correlation in data
(cluster effect in this case)? Then we can use scaling for SE, other method discussed in the text
book but I missed?
Binomial logistic model is for grouped or proportional data, and always have over dispersion
problem, then also also requires scaling for SE if both methods use for different data, why we
need to convert binary to group or proportional data for heart01 data on page 302?
Can you also please show how to scale SE with Wald statistic?
Instructor’s Response:

© Statistics.com, LLC Page 45


Binary response data structurally cannot be overdispersed. If there appears to be extra
correlation in the data, ie observations care clustered or longitudinal, then you need to adjust by
scaling, robust SE,s bootstrapping, etc.
Binomial or grouped models can be extradisperssed - they don't have to be. For a grouped logit
model, if the Pearson dispersion is quite a bit greater than 1, then scale, robust SE, bootstrap,
etc. the notion of implicit overdispersed binary data means that if you group binary response
model and it is overdispersed, then the binary model had within it the extra correlation. But it
still is not overdispersed in the standard sense.
You don't need to convert an obs based data set to grouped. But lets say that the data is
clustered and you feel that an adjustment of some sort should be made --- that there is
correlation in the data. If you can, group the data and see if it is indeed over or underdispersed
and make the appropriate adjustments.

Subject: Linearity of Slopes Test


Participant’s Question:
On p. 88-89 (reference Logistic Regression Models, Joseph M. Hilbe, 2009) of the text, you put
some emphasis on the fact that the age groupings run for 20 years age, stating this several times
and going to the effort of regrouping ages such that this is true. On the top of p. 89, you say that
you "see a clear linear trend in the age groups" and "the results indicate that the variances [I
think you mean this to be "variations"] in the smooth are largely artefacts of the data..."
Just to be sure -- what you are looking for here is to:
1) make the age groups all span the same number of years (here, 20); and
2) check the coefficients from the resulting logit regression to be sure that the increase from
one category to another is the same throughout the number of categories.
Here, for example, the increase from baseline to age23 is 1.13 and the increase from age23 to
age4 is 2.19-1.13 or 1.06, and 1.06 is approximately equal to 1.13? If there had been an age5
category, you would have expected it to increase from 2.19 by about 1.1 as well (i.e, to about
3.3)? And so on? But for this to be a valid test, one needs to ensure the categorical variables all
span the same number of years? And the final conclusion -- if all checks out -- is that the we can
model the variable as continuous? So what we are doing here is to break the continuous
variable down into categories to ensure that it indeed can be modeled as continuous (i.e., is it
linear in the logit) and can be appropriately modeled as continuous since it is linear in the logit?
Instructor’s Response:
A good logistic continuous predictor will have a single slope anywhere along the range of values.
We can see steadily increasing slopes in the age variable as seen when dividing age into 20 year
age groups. The underlying continuous line would have a rising. We had to combine two levels
since the slopes .5 and 1.52 were not statistically significant. But 1.52 and 2.2 (2.1958) are
statistically separate. So the slopes are really different. The others must be combined. And what
do we get when we combine the lower levels --- a slope of 1.25. This is not a surprise. The slopes
of 0 (reference), 1.25, and 2.2 are all statistically different, and indicate a true increase of the
odds of death for increasing age levels.
I hope that answers your question. I chose 20 year groups since it seemed clinically reasonable
to me. I usually don't just let the computer decide for me.

© Statistics.com, LLC Page 46


WEEK 4

Subject: Deciding Variables in Model


Participant’s Question:
When you are deciding which variables to include in your best model, are you mainly using a p-
value cutoff to decide which ones to use? I have seen people use a cutoff where they include
every variable with a cutoff of p<0.25, for example. Also, are those p-values generated by
running all the variables together or running each variable separately?
Instructor’s Response:
The most common approach is to run a separate analysis with each outcome. All outcomes that
have an unadjusted p-value less than some cutoff are put into a "full" model. Usually, the cutoff
is somewhat more liberal than 0.05; for example, many researchers use 0.10 or even
higher. Once you have the complete list, a full model is estimated. Sometimes people then cull
further from this model, eliminating those variables that have p-values greater than 0.20 or
even lower (but not lower than 0.05). There is no single correct approach to how this is done,
but one tries not to be too haphazard about it. Some people treat all covariates as candidates
for inclusion/deletion, and others have another set of variables that are always included in the
analyses no matter what (like variables that are to be tested as part of the specific aims of a
research project).

Subject: Chi2 goodness fit using beta-binomial


Participant’s Post:
I tried to run the Pearson Chi2 goodness of fit on some data using the beta-binomial family. I get
the following error:
'arg' should be one of "simple”, "weighted”, "partial”
I do not get this error with the binomial or quasi-binomial families. Is there a way to run the
Pearson Chi2 with a beta-binomial model?
Also, the book used sigma as a dispersion measure for the beta-binomial model. Exactly what is
sigma and what constitutes a good sigma value?
Another Participant’s Post:
My understanding, which is probably wrong and needs to be clarified by Dr. Hardin, is that we
use the Pearson Chi2 GOF for the binomial and quasibinomial models. If we run a beta binomial
model then we would look for a Sigma with a p <0.05. If Sigma in Beta Binomial model is <0.05
then it would be the preferred model.
That's how I understand it but I would also like guidance....
I'm wondering if you came to this question from the exercise? I was trying to work through the
final project OPTION 2. I can't find a good model that fits the data. I tried to fit: binomial,
quasibinomial and beta binomial ... no GOF on any so I'm a little stumped...
Update: actually I just ran a poisson model on OPTION 2 and I think it fits (according to Pearson
Chi2 it does but I'm not sure if I am suppose to use Pearson Chi2 to assess poisson GOF)...
Instructor’s Response:
Let me answer the sigma question first. The beta-binomial and binomial have the following
properties:
Binomial Beta-binomial
Mean mu mu
Variance mu(1-mu) mu(1-mu)[(1+n*sigma)/(1+sigma)]

© Statistics.com, LLC Page 47


So, the beta-binomial has a conditional variance that is (1+n*sigma)/(1+sigma) times as large as
the conditional variance of the binomial. If you estimate a binomial model and see that there is
evidence that the data are over dispersed relative to the binomial, you can then estimate a
beta-binomial model. When you estimate a beta-binomial model, you can assess whether the
data would be estimated just as well using the binomial by assessing a test of H0:sigma=0. If
that test is not rejected, then the conditional variance of the beta-binomial is no different than
the conditional variance of the binomial and you may as well use the simpler model (the
binomial). If that test is significant, then you have justified that the conditional variance is
significantly greater than that allowed for by the binomial.
Thus, to validate using the beta-binomial, all you need is a significant test of whether sigma=0;
any value of sigma that is significant is a "good value" in the sense of the question you posed.
Can you show me what commands you issued in R? I am not sure how to track down the error
message that you got, so I would like to replicate it myself. Once I see it, I can advise on how to
get around it.
Participant’s Reply:
First, I got this error using the P__disp command and the following command sequence from the
book.
ha=gamlss(heartatk ~ factor(sex)+
age+tgresult,family=BB,data=na.omit(lr_model))
pr=sum(residuals(ha,type="pearson")^2)
df=ha$df.residual
p_value=pchisq(pr,ha$df.residual,lower=F)
print(matrix(c("Pearson Chi GOF","Chi2","df","p-value","
",round(pr,4),df,round(p_value,4)),ncol=2))
Second, to evaluate the significance of sigma, I would need a standard deviation or variance -
unless of course the variance of sigma is 1.
Is the variance of sigma =1?
Or is this significance in the output of the summary command and I just don't know how to read
it?
Instructor’s Response:
The error message is with regard to the available residuals from the object returned by
gamlss(). You would have to create those residuals yourself since there is no support for
Pearson type residuals. The definition of Pearson residuals is:
r = (y - muhat) / sqrt(v(muhat))

v(muhat) for beta-binomial is muhat*(1-


muhat/n)*(1+n*sigmahat)/(1+sigmahat)
where muhat is the estimated mean value (using the link function). That said, I am not sure
what is being estimated here given that the outcomes are all 0/1. When that is the case, the
variance of the beta-binomial is exactly the same as the variance of the binomial, so you are
better off just estimating a logistic regression.
The variance of sigma is part of the overall variance covariance matrix. You might be able to use
the vcov() function to get that matrix, or you can just directly look at the standard error when
using summary(). If you don't use summary() the amount of information could vary and the
standard error of the sigma estimate might not be printed. Rest assured it is calculated and
hiding somewhere - you just have to work to find it!
Anyway, try the vcov() function on the returned object or try the summarize() function on that
object. if neither works, look at the names of the returned object and look through those

© Statistics.com, LLC Page 48


objects. Also, check the documentation for the package that you downloaded that provided the
gamlss() function.
In the above, "returned object" is the one returned from gamlss().

Subject: Dropping a Significant Factor


Participant’s Post:
When using the drop1 command, I have seen some data sets where certain factors are clearly
significant, but they do seem to materially affect the AIC value. When you drop a factor from a
model, is there any sort guideline to use when dropping a significant factor that does not seem
to materially affect the overall "goodness" of a fit?
Instructor’s Response:
Unfortunately, no. All goodness of fit statistics attempt to summarize goodness of fit into a
single number. A tall order to be sure because goodness of fit might refer to prediction, or ROC,
or likelihood, etc. For different situations, you will weight these topics differently. I often see
people write that there are 5 candidate models and that the AIC is close for the best 2, but looks
somewhat different from the other three. They then will often choose model 2 and justify that
choice as "the second model doesn't differ substantially interns of AIC and has some other
desirable property (like shorter covariate list, or inclusion of some previously substantiated
covariate, etc).

I tend to drop factors when my consulting subject matter expert tells me that it isn't
theoretically justified or that the inclusion of other factors takes the place of the candidate for
exclusion. I like that reason far more than any kind of "automated" or "rule-based" reason.

To be clear, other people might offer a different response here. I am comfortable with my
response because I rarely deviate from the attitude that the model is just a tool that is going to
allow me to draw inference about a population and address the study hypotheses. There are
lots of tools in the toolbox, and frankly, they almost always lead me to the same answer anyway,
so justification of my choice is rarely required.

Subject: Removing Intercept


Participant’s Post:
Under what conditions would we ever want to remove the intercept from the model, using ' -1 '
in the model definition? What is the benefit to not having an intercept calculated?
Instructor’s Response:
There are pedagogical illustrations where I might estimate a model without an intercept just so I
can include a coefficient for all levels of a single classification covariate, but otherwise, I rarely
see anyone do this. The only applications I see are when a researcher "knows" or centers all of
the data so that the overall mean is set to zero.

Subject: Logistic Regression Modeling Steps


Participant’s Post:
This may seem really basic, however, I like to create lists of steps to be taken when running
analysis. I created the following very basic overview of the major steps I see if modeling binary

© Statistics.com, LLC Page 49


response variables and would like any advise on incorrect steps or missing steps. I know it's
more complicated, but there are the most basic steps in the process I could identify:

LOGISTIC REGRESSION BASIC MODELLING STEPS


1. Create model

2. Run goodness-of-fit statistics


a. Pearsons Chi2 GOF (p-value >0.05 indicates a well-fitted model)
b. Hosmer-Lemeshow Statistic (p-value >0.05 indicates a well-fitted model)

3. Check model Dispersion (P__disp)

4. Based on GOF and Dispersion statistic determine whether:


a. Model is a good fit then analyze model results

b. Model is not a good fit and there is a need to:


i. Make adjustments (e.g., using robust SEs)
ii. Use alternative PDF (e.g., beta-binomial, poisson, etc.)

Instructor’s Response:
I hesitate to say that these are "the basics" because in some cases, this is actually more
complete than might be carried out. That said, this is a nice collection of steps that would stand
up to scrutiny of most analyses. I will say that I personally have never carried out the Hosmer-
Lemeshow test. Mostly, because I just don't have much confidence in it. If my client wanted it,
and there was a continuous covariate (or I could verify that there were a really large number of
covariate patterns, then I would run it. Otherwise, I would talk them out of it.

The other comment that I would make is that I tend to bounce between 1 and 4bii as the
covariates of interest are identified. Then, I go through the other items on the list. In building a
model, I often create a table of all possible covariates and run two sample tests for the outcome
to see whether there is an unadjusted association, and to investigate the extent and nature of
missing data over all of the covariates. The bivariate associations and the missing data
investigation might uncover certain estimability problems like quasi- or complete
separation. This is when a covariate doesn't have variability across at least one level of the
covariate. Imagine that you have an outcome success/failure and that you have data on people
for which you know their sex. Now, imagine that some of the males fails and some
succeed. However, imagine that all of the females succeed. In such a case, usual logistic
regression models will fail to estimate a parameter for female.
So, in your create model step 1, add: 1a) calculate bivariate associations (checking for
estimability problems) and 1b) record missingness. Otherwise, it looks good!

Subject: measure dispersion


Participant’s Question: I am starting to get terms mixed up. What measure for dispersion do I
use when? E.g. is the following correct?
* Person dispersion statistic for grouped data in a binomial regression

© Statistics.com, LLC Page 50


* Sigma for grouped data using beta-binomial regression [is the benchmark value for no
dispersion the same as for Person (=1)]
* Comparing the delta statistics from a binomial to a quasibinomial regression
* are there other measures of dispersion I should consider?
Instructor’s Response:
Believe me, many statisticians get the terms mixed up, and many books give incorrect
information. Usually extra-dispersion, which is usually in the form of overdispersion - relates to
count models, but for binomial models we may need to be concerned with extradisperson in
grouped binomial models.
There is a difference between a dispersion statistic and a dispersion parameter. A dispersion
statistic indicates possible or likely extradispersion in the data that is being modeled by logistic
regression (or another Bernoulli-based model). A dispersion parameter is a related model with
an extra parameter that adjusts for the extra dispersion in the data that cannot be adjusted for
by the lower level model. The Pearson Chi2 statistic divided by the residual degrees of freedom
is the dispersion statistic for count models, and is a dispersion statistic for binomial models. The
beta binomial has an extra parameter that attempts to adjusts for the the extra dispersion in the
data. Remember, it’s the data that is extradispersed - i.e., either overdispersed or
underdispersed.
In count models the base model is the Poisson model, whose mean and variance are identical.
When the variance exceeds the mean, This indicates that there is extra dispersion, or
correlation, in the data that cannot be modeled properly by the Poisson model without bias.
This is indicated by the dispersion statistic. The negative binomial model has an extra parameter
that attempts to adjust for the overdisperson in the data that cannot be modeled properly by
Poisson regression. There are also 3-parameter count models that have yet an extra parameter
to adjust for extradispersion that cannot be properly modeled by negative binomial or other two
parameter count models.
If the groped logistic dispersion is 1, there is no extra dispersion in the data. If it’s over 1, then a
beta binomial or generalized binomial many be needed to account or model the data. The
higher the dispersion statistic, the greater is the value of sigma. Theoretically, if a grouped
logistic model dispersion statistic is 1, the value of the BB dispersion parameter approximates
0.
A quasibinomial model is a name used only with R. It is the same thing as scaling the standard
errors by the dispersion statistic. To get quasibinomial SEs (the coefficients are identical) simply
multiply the model SEs by the square root of the dispersion statistic. A much better way to
adjust for overdispersion using the logistic model itself is to employ robust or sandwich SEs. That
is what is recommended by all major statistics in the area. Bootstrapped SEs are also well
accepted, but sandwich SEs are preferred. I show how to get these in the book, but R does a
crummy job i providing this for researchers. I show how all of these statistics and adjustments
are related in detail, together with R code, in my book: Hilbe and Robinson (2013), Methods of
Statistics Model Estimation, Chapman & Hall/CRC
I think that the above is all you need to consider for our discussion. What model you use
depends on the source of extra dispersion. If it is caused because the data is formatted in panels
(as in longitudinal data) then you should use a fixed effects, random effects, or perhaps mixed
effects or hierarchical model. If you think that the model is missing some other information for
which you have either specific or even summary statistics, then a Bayesian model will very likely
be what you need.

© Statistics.com, LLC Page 51


INDEX

A L
Adjusted Model SEs · 32 Likelihood ratio statistic · 24
AIC · 21, 22, 41, 42 Likelihood ratio test · 24
AIC statistic · 42 Linearity of Slopes Test · 46
ANOVA · 21 Linearity Test pg 89 · 22, 23, 24, 25, 40, 42, 44, 46
link function · 7, 8
Logistic link function · 7
B logistic model · 25, 42, 45
logistic regression · 24, 39, 42, 43, 44, 45
Bayesian coefficient · 44 logit · 10, 24, 25, 38, 45, 46
BIC statistics · 42 Log-Likelihood · 23, 44
Binary Logistic Model · 45 Log-Likelihood formation · 27
Box-Tidwell test · 24

M
C
mean · 9, 10, 23, 37, 38, 40, 44, 46
Canonical form · 4 Missing value · 41
Chi2 goodness fit using beta-binomial · 47 Modeling Steps · 49
confidence interval · 23
Confidence Intervals
quantiles · 4 O
Continuous Variables · 36
Cut Points · 34 odds ratio · 10, 37, 38, 39
Odds Ratio · 11, 37, 38
Odds Ratio/Standard errors · 7
D Over-Dispersion and Under-Dispersion · 30
Over-fitted models · 20
Deciding Variables in Model · 47
Deviance statistic · 40
dispersion · 50 P
Dropping a Significant Factor · 49
pearson dispersion · 46
Percentages in Logistic Analysis · 2
E Pg75 Course text P-values · 22
point estimate · 23
empirical distribution · 26 posterior distribution · 44
Power and Unbalanced Questions · 1
Predict function · 8
F prior probability distribution · 44
probit model · 45
profile likelihood · 22, 23
Fracpoly · 25
P-value · 45
fracpoly models · 26

G Q
quantile regression · 26
g values in Hl Algorithm · 11
quasibinomial · 43
Grouping Data · 32
Questions · 19

© Statistics.com, LLC Page 52


standard error · 10, 23
R Standard Errors of Odds Ratios · 10
Summary on Logistic Regression · 17
R ‘factor’ value in Logistic Regression · 14
Reference level in variable · 26
Removing Intercept · 49
Residual analysis · 42
T
Risk ratio and Odds ratio · 5
ROCtest Function in R · 13 Two link functions · 3

S W

Scaled and Robust Standard Errors · 13 Wald statistic · 45


Some useful sites · 39 Wald test · 24

© Statistics.com, LLC Page 53

You might also like