You are on page 1of 7

Discussion (Week 4): MLE and method of moments Page 1 of 7

Setup
A model
Method of moments estimator
Maximum likelihood estimator
Exercise
Method of moments estimator

Maximum likelihood estimator

Discussion (Week 4): MLE and


method of moments
Jongbin Jung, Camelia Simoiu, Jerry Lin

Setup
During a long vacation to some foreign country, you’ve been taking the same bus every
morning. Since you’re on vacation, the “morning” has been starting at pretty random
times for you, but you’ve been told that the buses in this country arrive in constant
intervals throughout the whole day.

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 2 of 7

During your stay so far, the times you’ve waited for a bus (in minutes) have been:

16, 2, 6, 7, 19

We wish to know the constant interval in which buses arrive.

A model
Let’s call this interval θ . One way to model the historically observed wait times,
X1 , X2 , ⋯ , X5 , is to treat them as independent and identially distributed draws from
a Uniform(0, θ ) distribution. This does require, however, that you’re willing to make at
least two assumptions:

1. your arrival time at bus stops in the mornings has been pretty random
2. the bus you take does indeed arrive every θ minutes.

While not everyone might be on board with these assumptions, let’s just agree to take
them as given, for the purpose of illustration1.

Similar to last week’s example (../week_2/week2_notes.html), there are a few natural


and intuitive estimators for θ , given the observed data. But this week, we want to take a
more principled approach in finding estimators, so that we have a few ways to construct
estimators in situations where obvious/intuitive ones aren’t available. The two methods
we look at are method of moments and maximum likelihood estimators

Method of moments estimator


If we assume that our observed wait times Xi follow a Uniform(0, θ ) distribution, then
we know from the properties of a Uniform distribution
(https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)), that the first
moment of Xi given the fixed parameter θ , i.e, Eθ (Xi ) is 2 . But remember, this is a
θ

theoretical value, and not something we can compute from the observed data. With the
data, we can estimate the j -th moment of Xi via

1 n
∑ (Xi )j
n i=1

or for the first moment, simply the sample mean:

1 n
X̄n = ∑ Xi .
n i=1

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 3 of 7

Given both the theoretical and estimated moments, one way to estimate the parameter
θ would be to set it such that our estimated moment is exactly the theoretical moment2.
~
In other words, we wish to set our estimator θ such that
~
θ
= X̄n
2
~
θ = 2X̄n .

In the case of our example, we would compute

16 + 2 + 6 + 7 + 19
2( ) = 20.
5

Maximum likelihood estimator


Another way to construct an estimator is to start from the joint distribution of the
observed data, f(x1 , x2 , ⋯ , xn ; θ) . In the case of Xi ∼ Uniform(0, θ) , we have

1/θ if xi ≤ θ
f(xi ; θ) = {
0 otherwise

and, since Xi are all independent,

(1/θ)n if xi ≤ θ ∀i
f(x1 , x2 , ⋯ , xn ; θ) = {
0 otherwise

Recall that in this joint density, the parameter θ is fixed, and f is a function of the
observable data X1 , ⋯ , Xn . But in reality, we’ve observed draws of each Xi , and
would like to know how likely our observed data would have been under different values
of θ . We can represent this likelihood by thinking of the joint density of the data given θ
as a function of θ , where the data Xi are now fixed to the values we’ve observed:

(1/θ)n if θ ≥ xi ∀i
Ln (θ) = {
0 otherwise

Then, one estimate of θ might be to find the value of θ that makes this likelihood as big
as possible, a.k.a., the maximum likelihood estimator.

One common technique for finding the value of a parameter that maximizes the
likelihood function is to take the log of the likelihood, often referred to as the log-
likelihood. In this case, that would be:

−n log(θ) if θ ≥ xi ∀i
ℓn (θ) = log Ln (θ) = {
−∞ otherwise

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 4 of 7

To find the value of θ that maximizes ℓ(θ) , we make two observations:

1. θ cannot be smaller than any of the observed values3, otherwise ℓ(θ) would be
−∞ !
2. −n log(θ) is a monotonically decreasing function
(https://en.wikipedia.org/wiki/Monotonic_function) of θ ; which means we want to
have the smallest possible value of θ , in order to maximize the value of ℓ(θ) .

Given these observations, the natural value θ^ that maximizes ℓ(θ) can be found to be
max(X1 , X2 , ⋯ , Xn ) , or in our case, θ^ = max(16, 2, 6, 7, 19) = 19 .
We can also use R to visualize the log-likelihood function ℓ(θ) (and its maximum) as a
function of θ , given our observed data. First, we implement ℓ(θ) as a function in R 4:

loglikelihood <- function(theta, X) {


n <- length(X)
cond <- max(X) <= theta
ifelse(cond, 1 / theta, -Inf)
}

Then, for the purpose of plotting, we create a data frame with a range of possible values
for θ and the corresponding likelihood:

likelihood_df <- tibble(theta = 10:40) %>%


mutate(l = loglikelihood(theta, X = example_times)) %>%
filter(l != -Inf)

ggplot(likelihood_df, aes(x = theta, y = l)) +


geom_vline(xintercept = example_times, linetype = "dashed") +
geom_line() +
geom_point(data = function(d) top_n(d, 1, l), size = 4) +
scale_x_continuous(expression(theta)) +
scale_y_continuous("Log-likelihood\n")

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 5 of 7

If the two estimators we find look familiar, that’s because they are the estimates we
present in assignment #2 (https://5harad.com/mse125/#hw2).

Exercise
Now that we’ve seen how to find both MLE and method of moments estimators, let’s
practice with a different distribution.

A sample of 3 independent observations (X1 = 0.4 , X2 = 0.7 , X3 = 0.9 ) is


collected from a continuous distribution with density function:

f(x; θ) = θxθ−1 ,

where 0 < x < 1 . We would like to estimate the unknown parameter θ .

Let’s use the two methods we’ve covered above to find estimators of θ .

Method of moments estimator


Overview Theoretical moment Observed moment Estimator

Recall that the method of moments involves

1. Analytically finding the theoretical moments of the data

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 6 of 7

2. Computing the observed moments of the data


3. Setting the estimator such that the observed moments are equal to the
theoretical moments

Finding estimators for j unknown parameters would require j equations, and hence we
would need to find up to the jth moment. For this exercise, since we have just one
unknown parameter θ , we only need to find one equation involving the first moment.

Maximum likelihood estimator


Overview Finding the log-likelihood Maximizing Visualization

As we have done in the above example, an MLE can be found in two steps:

1. Find the log-likelihood as a function of θ , starting from the joint distribution of


the data, f(X1 , X2 , … , Xn ; θ) .
2. Given the observed data, find the value of θ that maximizes the likelihood.

1. While it’s quite difficult to come up with real-life examples that fall exactly into
some theoretical distribution, in practice, models based on such simplified
assumptions provide surprisingly useful results.

2. Note that in this case, we only have one unknown, θ , so solving one equation is
sufficient; hence we only need to look at the first moment. In cases where there
are more than one unknown parameter, we would have to compare the
theoretical and estimated values of higher moments as well. More specifically, if
we wanted to know both the lower and upper bounds of the Uniform distribution
(instead of just assuming the lower bound to be 0), we would need at least the first
and second moments of Xi .

3. In our example, if θ were actually smaller than the largest time we actually waited,
this would imply that we had an unlucky day during which the bus was more
delayed than usual. However, in theory (and under the assumptions of our model),
we assume this doesn’t happen.

4. A proper implementation of the (log-)likelihood (as far as the math is concerned)


would actually be a closure
(https://en.wikipedia.org/wiki/Closure_(computer_programming)) that takes the
data (X ) and returns the log-likelihood as a function of the parameter (θ ); but to
keep things simple here we’re just going to create a function of both the
parameter and data, and trust the user (us!) to be smart enough to keep the data
constant.

5. This is a bit sloppy. In reality, we would also have to check the second derivative of

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020
Discussion (Week 4): MLE and method of moments Page 7 of 7

ℓ(θ) to make sure that the we find is indeed a maximum.


θ^

https://5harad.com/mse125/discussions/week_4/week4_notes.html 2/5/2020

You might also like