You are on page 1of 5

Technical Details about Maximum Likelihood Estimation

for Logistic Regression

Predictive models are made up of several things: an output variable that we want to predict, a set of input
variables that give us the information set for making this prediction, and a mathematical model that relates
these inputs to the outputs. Embedded within the mathematical model are parameters that govern the
behavior of the mathematical model, but usually these parameters are unknown. The process of using data to
make inferences about these parameters is known as estimation or training. For example, we could try to
predict home values using the square footage of a home in a linear regression model. The output variable is
the home value, the input variable is the square footage of a home, and the mathematical model is (home
value) = a + b * (square footage of a home) where a and b are parameters. In general we have data about a
sample of homes and from this sample we want to guess (or make an inference) about reasonable values of
these parameters. Moreover, we also want to know how good are these guesses as we generalize to new
homes (or observations) that are not in our dataset.

The purpose of this note is to explain how we can estimate a logistic regression model using a procedure
known as maximum likelihood estimation. This is a very general approach to estimation and can be used to
estimate many different types of models. The purpose of this note is to also explain the properties of these
estimates like standard errors, t-values and p-values as reported by statistical packages like R.

Logistic Regression Model

First let’s define our notation:

p Probability that an event occurs


m Weighted average of inputs
yi Output value for observation i, either a 1 if true or 0 if false
xij Input value of variable j for observation i
i Index that goes from 1 to N
bj Parameter associated with variable j
N Number of observations
ln() Natural logarithm function
exp{} Exponential function, exp{2.5}=e 2.5

Our logistic regression can be formulated in two forms. The first relates the log odds ratio to a score:

 p 
ln  =µ (1)
 1− p 

Where our score is a weighted average or linear combination of our input variables:

µββ
= 0 + 1 xi (2)

-1-
For notational convenience we can think of the input variable associated with β 0 as xi 0 where j=0 and this
input variable is always one, xi 0 = 0 . The second inverts this relationship and expresses the relationship
between the probability and the weighted score:

exp{µ}
p= (3)
1 + exp{µ}

This is equivalent to the following form:

1
p= (4)
1 + exp{− µ}

One helpful property of the logistic function is:

exp{− z} 1 (5)
1 − F ( z) = =F (− z ), where F ( z ) =
1 + exp{− z} 1 + exp{− z}

Using calculus we can also derive the first derivative:

(1 exp{− z}) −1
F ( z ) =+
F ′( z ) =+(1 exp{− z}) −2 exp{− z}
1 exp{− z} (6)
F ′( z ) =
1 + exp{− z} 1 + exp{− z}
F ′( z ) F ( z )(1 − F ( z ))
=

We apply our logistic regression model to a dataset to make predictions about every observation which we
denote by the subscript i in order to form predictions about the probability of every observation:

1
P ( yi ) = (7)
1 + exp{−( βββ
0 + 1 x1i + 2 x2 i +)}

The core of most statistical analyses is to define and compute the likelihood function. The likelihood is the
probability that we observe the dataset:

= ( y1 , y2 ,..., yN ; ββββ
P ( y1 , y2 ,..., y N ) P= 0 , 1) L( 0 , 1 ) (8)

Notice that there is an important distinction in interpreting the likelihood versus the probability distribution.
Namely that for the likelihood function we assume that the data is known but the parameters are unknown,
which is the reverse of the interpretation of the probability distribution. This interpretation is critical for us
since we want to use the likelihood to make an inference about the parameters. The training or estimation
task is that we observe the data but not the parameters and we want to figure out what are reasonable
parameter values. The likelihood function allows us to understand which parameter values are likely or

-2-
consistent with our observed dataset, and which parameters values are unlikely or inconsistent with our
observed dataset.

Most frequently we assume that all the observations are independent, which is the assumption that we make
in our logistic regression problem. Under this assumption we can write our likelihood function for the
logistic regression as:

N
ββ
L(=0 , 1) ∏p i =1
i
yi
(1 − pi )1− yi (9)

In our dataset yi is a binary variable, either 0 or 1. When yi =1 then we have the chance that an event
occurs as being pi , and when yi =0 then we have the probability that our event occurs as 1 − pi . Notice that
the probabilities are functions of the data and the parameters.

Since probabilities fall between 0 and 1, our likelihood will likely be a very small number. An equivalent
problem to looking at the likelihood is to consider the log of the likelihood which would re-express (9) as:

N
( β 0 , β1 ))
ln( L= ∑ { y ln( p ) + (1 − y ) ln(1 − p )}
i =1
i i i i (10)

From calculus we can find the maximum by finding the parameter values where the first derivative is equal to
zero. Although we cannot find this point analytically we can use numerical algorithms like Newton-
Raphson’s method to find the best value by iteratively evaluating the function, taking a linear approximation,
and moving a little in the direction where there is a gain.

To implement the Newton-Raphson technique we need to find the first derivative of the likelihood function
with respect to the parameters:

∂ ln( L) N
=
∂β j
∑( y − p ) x
i =1
i i ij (11)

The vector of these first derivatives is known as the gradient:

 ∂ ln( L) 
 ∂β 
∇ =  (12)
0

 ∂ ln( L) 
 ∂β 
 1 

After some more calculus we can find the second derivative of the likelihood function with respect to the
parameters:

∂ 2 ln( pi ) N
= ∑ ( − pi (1 − pi ) xij xik ) (13)
∂β j ∂β k i =1

The matrix of these second derivatives is known as the Hessian:

-3-
 ∂ 2 ln(L) ∂ 2 ln(L) 
 ∂ββββ ∂ 0 ∂ 1 
 0∂ 0
H= 2 (14)
 ∂ ln(L) ∂ 2 ln(L) 
 
 ∂ββββ
1∂ 0 ∂ 1∂ 1 

Maximum Likelihood Estimation

The motivation between maximum likelihood estimation is to choose our parameter estimates that maximize
the likelihood function (or equivalently maximize the log-likelihood function). We call our best guess of the
unknown parameter estimates the maximum likelihood estimates (MLE):

βˆ = arg max ln( L(β)) (15)


β

Theoretically we can show some really important properties of these estimates. Namely that the

βˆ ~ N (β* , ∆) (16)

In this equation β* represents the “true”, unknown value and ∆ is the covariance matrix of these estimates.
The covariance is the inverse of the negative of the expected value of the Hessian matrix:

∆ ∆12 
∆ = ( − E [ H ]) =  11
−1
(17)
 ∆ 21 ∆ 22 

Marginally the maximum likelihood estimate for each parameter is a normal distribution:

ββ
ˆ ~ N ( *, ∆ )
i i ii (18)

The standard errors of the parameter estimates are the square roots of the diagonal terms in the variance-
covariance matrix, ∆ ii .

It is quite common to create a hypothesis test under the assumption that the parameter is equal to 0 (e.g.,
H 0 : βi* = 0 ). A test statistic of this hypothesis is known as a Z-statistic and under the null distribution has a
standard normal distribution:

βˆi
z= ~ N (0,1) (19)
∆ ii

We can compute the probability of the z-statistic:

=p Pr( Z > z ) (20)

-4-
The interpretation of this z-statistic is given that the null hypothesis is true (e.g., H 0 : β i* = 0 ) then the p-
value tells us what is the chance of getting an even more extreme event. Notice that it does not test whether
our model is correct, since it assumes that the null hypothesis is true. It is meant to indicate which values
would be unlikely to happen or are “significant” effects that should be investigated further.

-5-

You might also like