You are on page 1of 7

Difference between logit and probit models

http://stats.stackexchange.com/questions/20523/difference-between-logit-and-probit-models

What is the difference between the logit and the probit model?
I'm more interested here in knowing when to use logistic regression, and when to use
probit. If there's any literature which define it using R, that would be helpful as well.

The difference mainly in the link function.

In Logit: Pr(Y=1∣X)=[1+e−X′β]−1
In Probit: Pr(Y=1∣X)=Φ(X′β) (Cumulative normal pdf)
In other way, logistic has slightly flatter tails. i.e probit curve approaches the axes more
quickly than the curve.

Logit has better interpretation than probit. Logistic regression can be interpreted as
modeling log odds. Usually people start the modeling with logit. You could use likelihood
value to decide logit or probit.

A standard linear model (e.g., a simple regression model) can be thought of as having two
'parts'. These are called the structural component and the random component. For
example:
Y=β0+β1X+εwhere ε∼N(0,σ2)
The first two terms (that is, β0+β1X) constitute the structural component, and the ε (which
indicates a normally distributed error term) is the random component. When the response
variable is not normally distributed (for example, if your response variable is binary) this
approach may no longer be valid. The generalized linear model (GLiM) was developed to address
such cases, and logit and probit models are special cases of GLiMs that are appropriate for
binary variables (or multi-category response variables with some adaptations to the process). A
GLiM has three parts, a structural component, a link function, and a response distribution. For
example:
g(μ)=β0+β1X
Here β0+β1X is again the structural component, g() is the link function, and μ is a mean of a
conditional response distribution at a given point in the covariate space. The way we think about
the structural component here doesn't really differ from how we think about it with standard
linear models; in fact, that's one of the great advantages of GLiMs. Because for many
distributions the variance is a function of the mean, having fit a conditional mean (and given that
you stipulated a response distribution), you have automatically accounted for the analog of the
random component in a linear model (N.B.: this can be more complicated in practice).
The link function is the key to GLiMs: since the distribution of the response variable is non-
normal, it's what lets us connect the structural component to the response--it 'links' them
(hence the name). It's also the key to your question, since the logit and probit are links (as
@vinux explained), and understanding link functions will allow us to intelligently choose
when to use which one. Although there can be many link functions that can be acceptable,
often there is one that is special. Without wanting to get too far into the weeds (this can get
very technical) the predicted mean, μ, will not necessarily be mathematically the same as
the response distribution's canonical location parameter; the link function that does equate
them is the canonical link function. The advantage of this "is that a minimal sufficient
statistic for β exists" (German Rodriguez). The canonical link for binary response data
(more specifically, the binomial distribution) is the logit. However, there are lots of functions
that can map the structural component onto the interval (0,1), and thus be acceptable; the
probit is also popular, but there are yet other options that are sometimes used (such as the
complementary log log, ln(−ln(1−μ)), often called 'cloglog'). Thus, there are lots of possible
link functions and the choice of link function can be very important. The choice should be
made based on some combination of:
1. Knowledge of the response distribution,
2. Theoretical considerations, and
3. Empirical fit to the data.
Having covered a little of conceptual background needed to understand these ideas more
clearly (forgive me), I will explain how these considerations can be used to guide your
choice of link. (Let me note that I think @David's comment accurately captures why different
links are chosen in practice.) To start with, if your response variable is the outcome of a
Bernoulli trial (that is, 0 or 1), your response distribution will be binomial, and what you are
actually modeling is the probability of an observation being a 1 (that is, π(Y=1)). As a
result, any function that maps the real number line, (−∞,+∞), to the interval (0,1) will work.
From the point of view of your substantive theory, if you are thinking of your covariates
as directlyconnected to the probability of success, then you would typically choose logistic
regression because it is the canonical link. However, consider the following example: You
are asked to model high_Blood_Pressure as a function of some covariates. Blood pressure
itself is normally distributed in the population (I don't actually know that, but it seems
reasonable prima fascie), nonetheless, clinicians dichotomized it during the study (that is,
they only recorded 'high-BP' or 'normal'). In this case, probit would be preferable a-priori for
theoretical reasons. This is what @Elvis meant by "your binary outcome depends on a
hidden Gaussian variable". Another consideration is that both logit and probit
are symmetrical, if you believe that the probability of success rises slowly from zero, but
then tapers off more quickly as it approaches one, the cloglog is called for, etc.
Lastly, note that the empirical fit of the model to the data is unlikely to be of assistance in
selecting a link, unless the shapes of the link functions in question differ substantially (of
which, the logit and probit do not). For instance, consider the following simulation:

set.seed(1)
probLower = vector(length=1000)

for(i in 1:1000){
x = rnorm(1000)
y = rbinom(n=1000, size=1, prob=pnorm(x))

logitModel = glm(y~x, family=binomial(link="logit"))


probitModel = glm(y~x, family=binomial(link="probit"))

probLower[i] = deviance(probitModel)<deviance(logitModel)
}

sum(probLower)/1000
[1] 0.695
Even when we know the data were generated by a probit model, and we have 1000 data
points, the probit model only yields a better fit 70% of the time, and even then, often by only
a trivial amount. Consider the last iteration:

deviance(probitModel)
[1] 1025.759
deviance(logitModel)
[1] 1026.366
deviance(logitModel)-deviance(probitModel)
[1] 0.6076806
The reason for this is simply that the logit and probit link functions yield very similar outputs
when given the same inputs.

The logit and probit functions are practically identical, except that the logit is slightly further
from the bounds when they 'turn the corner', as @vinux stated. (Note that to get the logit
and the probit to align optimally, the logit's β1 must be ≈1.7 times the corresponding slope
value for the probit. In addition, I could have shifted the cloglog over slightly so that they
would lay on top of each other more, but I left it to the side to keep the figure more
readable.) Notice that the cloglog is asymmetrical whereas the others are not; it starts
pulling away from 0 earlier, but more slowly, and approaches close to 1 and then turns
sharply.
A couple more things can be said about link functions. First, considering the identity
function (g(η)=η) as a link function allows us to understand the standard linear model as a
special case of the generalized linear model (that is, the response distribution is normal,
and the link is the identity function). It's also important to recognize that whatever
transformation the link instantiates is properly applied to the parameter governing the
response distribution (that is, μ), not the actual response data. Finally, because in practice
we never have the underlying parameter to transform, in discussions of these models, often
what is considered to be the actual link is left implicit and the model is represented by
the inverse of the link function applied to the structural component instead. That is:
μ=g−1(β0+β1X)
For instance, logistic regression is usually represented:
π(Y)=exp(β0+β1X)1+exp(β0+β1X)
instead of:
ln(π(Y)1−π(Y))=β0+β1X
For a quick and clear, but solid, overview of the generalized linear model, see chapter 10
ofFitzmaurice, Laird, & Ware (2004), (on which I leaned for parts of this answer, although
since this is my own adaptation of that--and other--material, any mistakes would be my
own). For how to fit these models in R, check out the documentation for the function ?glm in
the base package.
(One final note added later:) I occasionally hear people say that you shouldn't use the
probit, because it can't be interpreted. This is not true, although the interpretation of the
betas is less intuitive. With logistic regression, a one unit change in X1 is associated with
a β1 change in the log odds of 'success' (alternatively, an exp(β1)-fold change in the odds),
all else being equal. With a probit, this would be a change of β1 z's. (Think of two
observations in a dataset with z-scores of 1 and 2, for example.) To convert these into
predicted probabilities, you can pass them through the normal CDF, or look them up on a z-
table.
(+1 to both @vinux and @Elvis. Here I have tried to provide a broader framework within
which to think about these things and then using that to address the choice between logit
and probit.)

In addition to vinux’ answer, which already tells the most important:

 the coefficients β in the logit regression have natural interpretations in terms of odds
ratio;
 the probistic regression is the natural model when you think that your binary outcome
depends of a hidden gaussian variable Z=X′β+ϵ [eq. 1] with ϵ∼N(0,1) in a
deterministic manner: Y=1 exactly when Z>0.
 More generally, and more naturally, probistic regression is the more natural model if
you think that the outcome is 1 exactly when some Z0=X′β0+ϵ0 exceeds a threshold c,
with ϵ∼N(0,σ2). It is easy to see that this can be reduced to the aforementioned case:
just rescale Z0 as Z=1σ(Z0−c); it’s easy to check that equation [eq. 1] still holds
(rescale the coefficients and translate the intercept). These models have been
defended, for example, in medical contexts, where Z0 would be an unobserved
continuous variable, and Y eg a disease which appears when Z0 exceeds some
"pathological threshold".
Both logit and probit models are only models. "All models are wrong, some are useful", as
Box once said! Both models will allow you to detect the existence of an effect of X on the
outcome Y; except in some very special cases, none of them will be "really true", and
their interpretation should be done with cautiousness.

An important point that has not been addressed in the previous (excellent) answers is the
actual estimation step. Multinomial logit models have a PDF that is easy to integrate,
leading to a closed-form expression of the choice probability. The density function of the
normal distribution is not so easily integrated, so probit models typically require simulation.
So while both models are abstractions of real world situations, logit is usually faster to use
on larger problems (multiple alternatives or large datasets).

To see this more clearly, the probability of a particular outcome being selected is a function
of the xpredictor variables and the ε error terms (following Train)
P=∫I[ε>−β′x]f(ε)dε
Where I is an indicator function, 1 if selected and zero otherwise. Evaluating this integral
depends heavily on the assumption of f(x). In a logit model, this is a logistic function, and a
normal distribution in the probit model. For a logit model, this becomes
P=∫∞ε=−β′xf(ε)dε=1−F(−β′x)=1−1exp(β′x)
No such convenient form exists for probit models.

Regarding your statement

I'm more interested here in knowing when to use logistic regression, and when to use probit
There are already many answers here that bring up things to consider when choosing
between the two but there is one important consideration that hasn't been stated yet: When
your interest is in looking at within-cluster associations in binary data using mixed
effects logistic or probit models, there is a theoretical grounding for preferring the
probit model. This is, of course, assuming that there is no a priori reason for preferring the
logistic model (e.g. if you're doing a simulation and know it to be the true model).
First, To see why this is true first note that both of these models can be viewed as
thresholded continuous regression models. As an example consider the simple linear mixed
effects model for the observation i within cluster j:
y⋆ij=μ+ηj+εij
where ηj∼N(0,σ2) is the cluster j random effect and εij is the error term. Then both the
logistic and probit regression models are equivalently formulated as being generated from
this model and thresholding at 0:
yij=⎧⎩⎨⎪⎪10if y⋆ij≥0if y⋆ij<0
If the εij term is normally distributed, you have a probit regression and if it is logistically
distributed you have a logistic regression model. Since the scale is not identified, these
residuals errors are specified as standard normal and standard logistic, respectively.
Pearson (1900) showed that that if multivariate normal data were generated and
thresholded to be categorical, the correlations between the underlying variables were still
statistically identified - these correlations are termed polychoric correlations and, specific to
the binary case, they are termedtetrachoric correlations. This means that, in a probit model,
the intraclass correlation coefficient of the underlying normally distributed variables:
ICC=σ^2σ^2+1
is identified which means that in the probit case you can fully characterize the joint
distribution of the underlying latent variables.
In the logistic model the random effect variance in the logistic model is still identified but it
does not fully characterize the dependence structure (and therefore the joint distribution),
since it is a mixture between a normal and a logistic random variable that does not
have the property that it is fully specified by its mean and covariance matrix. Noting this odd
parametric assumption for the underlying latent variables makes interpretation of the
random effects in the logistic model less clear to interpret in general.

What I am going to say in no way invalidates what has been said thus far. I just want to
point out that probit models do not suffer from IIA (Independence of Irrelevant alternatives)
assumptions, and the logit model does.

To use an example from Train's excellent book. If I have a logit that predicts whether I am
going to ride the blue bus or drive in my car, adding red bus would draw from both car and
blue bus proportionally. But using a probit model you can avoid this problem. In essence,
instead of drawing from both proportionally, you may draw more from blue bus as they are
closer substitutes.

The sacrifice you make is that there is no closed form solutions, as pointed out above.
Probit tends to be my goto when I am worried about IIA issues. That's not to say that there
aren't ways to get around IIA in a logit framework (GEV distributions). But I've always
looked at these sorts of models as a clunky way around the problem. With the
computational speeds that you can get, I would say go with probit.

You might also like