Professional Documents
Culture Documents
Maximum Likelihood Estimation PDF
Maximum Likelihood Estimation PDF
Tutorial
Tutorial on maximum likelihood estimation
In Jae Myung*
Department of Psychology, Ohio State University, 1885 Neil Avenue Mall, Columbus, OH 43210-1222, USA
Received 30 November 2001; revised 16 October 2002
Abstract
In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are
researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squares
estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an
indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose
of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of
some of the basic principles.
r 2003 Elsevier Science (USA). All rights reserved.
0022-2496/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved.
doi:10.1016/S0022-2496(02)00028-7
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 91
In this tutorial paper, I introduce the maximum models parameter. As the parameter changes in value,
likelihood estimation method for mathematical model- different probability distributions are generated. For-
ing. The paper is written for researchers who are mally, a model is dened as the family of probability
primarily involved in empirical work and publish in distributions indexed by the models parameters.
experimental journals (e.g. Journal of Experimental Let f yjw denote the probability density function
Psychology) but do modeling. The paper is intended to (PDF) that species the probability of observing data
serve as a stepping stone for the modeler to move vector y given the parameter w: Throughout this paper
beyond the current practice of using LSE to more we will use a plain letter for a vector (e.g. y) and a letter
informed modeling analyses, thereby expanding his or with a subscript for a vector element (e.g. yi ). The
her repertoire of statistical instruments, especially in parameter w w1 ; y; wk is a vector dened on a
non-linear modeling. The purpose of the paper is to multi-dimensional parameter space. If individual ob-
provide a good conceptual understanding of the method servations, yi s, are statistically independent of one
with concrete examples. For in-depth, technically more another, then according to the theory of probability, the
rigorous treatment of the topic, the reader is directed to PDF for the data y y1 ; y; ym given the parameter
other sources (e.g., Bickel & Doksum, 1977, Chap. 3; vector w can be expressed as a multiplication of PDFs
Casella & Berger, 2002, Chap. 7; DeGroot & Schervish, for individual observations,
2002, Chap. 6; Spanos, 1999, Chap. 13).
f y y1 ; y2 ; y; yn j w f1 y1 j w f2 y2 j w
?fn ym j w: 1
Fig. 1. Binomial probability distributions of sample size n 10 and probability parameter w 0:2 (top) and w 0:7 (bottom).
92 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100
which is known as the binomial distribution with interest, nd the one PDF, among all the probability
parameters n 10; w 0:2: Note that the number of densities that the model prescribes, that is most likely to
trials n is considered as a parameter. The shape of this have produced the data. To solve this inverse problem,
PDF is shown in the top panel of Fig. 1. If the we dene the likelihood function by reversing the roles of
parameter value is changed to say w 0:7; a new PDF the data vector y and the parameter vector w in f yjw;
is obtained as i.e.
10! Lwjy f yjw: 5
f y j n 10; w 0:7 0:7y 0:310y
y!10 y!
Thus Lwjy represents the likelihood of the parameter
y 0; 1; y; 10 3
w given the observed data y; and as such is a function of
whose shape is shown in the bottom panel of Fig. 1. The w: For the one-parameter binomial example in Eq. (4),
following is the general expression of the PDF of the the likelihood function for y 7 and n 10 is given by
binomial distribution for arbitrary values of w and n:
Lw j n 10; y 7 f y 7 j n 10; w
n!
f yjn; w wy 1 wny 10! 7
y!n y! w 1 w3 0pwp1: 6
7!3!
0pwp1; y 0; 1; y; n 4
The shape of this likelihood function is shown in Fig. 2.
which as a function of y species the probability of data There exist an important difference between the PDF
y for a given value of n and w: The collection of all such f yjw and the likelihood function Lwjy: As illustrated
PDFs generated by varying the parameter across its in Figs. 1 and 2, the two functions are dened on
range (01 in this case for w; nX1) denes a model. different axes, and therefore are not directly comparable
to each other. Specically, the PDF in Fig. 1 is a
2.2. Likelihood function function of the data given a particular set of parameter
values, dened on the data scale. On the other hand, the
Given a set of parameter values, the corresponding likelihood function is a function of the parameter given
PDF will show that some data are more probable than a particular set of observed data, dened on the
other data. In the previous example, the PDF with w parameter scale. In short, Fig. 1 tells us the probability
0:2; y 2 is more likely to occur than y 5 (0.302 vs. of a particular data value for a xed parameter, whereas
0.026). In reality, however, we have already observed the Fig. 2 tells us the likelihood (unnormalized probabil-
data. Accordingly, we are faced with an inverse ity) of a particular parameter value for a xed data set.
problem: Given the observed data and a model of Note that the likelihood function in this gure is a curve
Fig. 2. The likelihood function given observed data y 7 and sample size n 10 for the one-parameter model described in the text.
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 93
because there is only one parameter beside n; which is at wi wi;MLE for all i 1; y; k: This is because the
assumed to be known. If the model has two parameters, denition of maximum or minimum of a continuous
the likelihood function will be a surface sitting above the differentiable function implies that its rst derivatives
parameter space. In general, for a model with k vanish at such points.
parameters, the likelihood function Lwjy takes the The likelihood equation represents a necessary con-
shape of a k-dim geometrical surface sitting above a dition for the existence of an MLE estimate. An
k-dim hyperplane spanned by the parameter vector w additional condition must also be satised to ensure
w1 ; y; wk : that ln Lwjy is a maximum and not a minimum, since
the rst derivative cannot reveal this. To be a maximum,
the shape of the log-likelihood function should be
3. Maximum likelihood estimation convex (it must represent a peak, not a valley) in the
neighborhood of wMLE : This can be checked by
Once data have been collected and the likelihood calculating the second derivatives of the log-likelihoods
function of a model given the data is determined, one is and showing whether they are all negative at wi wi;MLE
in a position to make statistical inferences about the for i 1; y; k;1
population, that is, the probability distribution that @ 2 ln Lwjy
underlies the data. Given that different parameter values o0: 8
@w2i
index different probability distributions (Fig. 1), we are
interested in nding the parameter value that corre- To illustrate the MLE procedure, let us again consider
sponds to the desired probability distribution. the previous one-parameter binomial example given a
The principle of maximum likelihood estimation xed value of n: First, by taking the logarithm of the
(MLE), originally developed by R.A. Fisher in the likelihood function Lwjn 10; y 7 in Eq. (6), we
1920s, states that the desired probability distribution is obtain the log-likelihood as
the one that makes the observed data most likely, 10!
which means that one must seek the value of the ln Lw j n 10; y 7 ln 7 ln w 3 ln1 w:9
7!3!
parameter vector that maximizes the likelihood function Next, the rst derivative of the log-likelihood is
Lwjy: The resulting parameter vector, which is sought calculated as
by searching the multi-dimensional parameter space, is
called the MLE estimate, and is denoted by wMLE d ln Lw j n 10; y 7 7 3 7 10w
: 10
w1;MLE ; y; wk;MLE : For example, in Fig. 2, the MLE dw w 1 w w1 w
estimate is wMLE 0:7 for which the maximized like- By requiring this equation to be zero, the desired MLE
lihood value is LwMLE 0:7jn 10; y 7 0:267: estimate is obtained as wMLE 0:7: To make sure that
The probability distribution corresponding to this the solution represents a maximum, not a minimum, the
MLE estimate is shown in the bottom panel of Fig. 1. second derivative of the log-likelihood is calculated and
According to the MLE principle, this is the population evaluated at w wMLE ;
that is most likely to have generated the observed data
of y 7: To summarize, maximum likelihood estima- d 2 ln Lw j n 10; y 7 7 3
2
tion is a method to seek the probability distribution that dw2 w 1 w2
makes the observed data most likely. 47:62o0 11
which is negative, as desired.
3.1. Likelihood equation
In practice, however, it is usually not possible to
obtain an analytic form solution for the MLE estimate,
MLE estimates need not exist nor be unique. In this
especially when the model involves many parameters
section, we show how to compute MLE estimates when
and its PDF is highly non-linear. In such situations, the
they exist and are unique. For computational conve-
MLE estimate must be sought numerically using non-
nience, the MLE estimate is obtained by maximizing the
linear optimization algorithms. The basic idea of non-
log-likelihood function, ln Lwjy: This is because the
linear optimization is to quickly nd optimal parameters
two functions, ln Lwjy and Lwjy; are monotonically
that maximize the log-likelihood. This is done by
related to each other so the same MLE estimate is
obtained by maximizing either one. Assuming that the
1
log-likelihood function, ln Lwjy; is differentiable, if Consider the Hessian matrix Hw dened as Hij w
wMLE exists, it must satisfy the following partial 2
@ ln Lw
i; j 1; y; k: Then a more accurate test of the convexity
differential equation known as the likelihood equation: @wi @wj
condition requires that the determinant of Hw be negative definite,
@ln Lwjy
0 7 that is, z0 Hw wMLE zo0 for any kx1 real-numbered vector z; where
@wi z0 denotes the transpose of z:
94 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100
Fig. 3. A schematic plot of the log-likelihood function for a ctitious one-parameter model. Point B is the global maximum whereas points A and C
are two local maxima. The series of arrows depicts an iterative optimization process.
searching much smaller sub-sets of the multi-dimen- tries to improve upon an initial set of parameters that is
sional parameter space rather than exhaustively search- supplied by the user. Initial parameter values are chosen
ing the whole parameter space, which becomes either at random or by guessing. Depending upon the
intractable as the number of parameters increases. The choice of the initial parameter values, the algorithm
intelligent search proceeds by trial and error over the could prematurely stop and return a sub-optimal set of
course of a series of iterative steps. Specically, on each parameter values. This is called the local maxima
iteration, by taking into account the results from the problem. As an example, in Fig. 3 note that although
previous iteration, a new set of parameter values is the starting parameter value at point a2 will lead to the
obtained by adding small changes to the previous optimal point B called the global maximum, the starting
parameters in such a way that the new parameters are parameter value at point a1 will lead to point A, which is
likely to lead to improved performance. Different a sub-optimal solution. Similarly, the starting parameter
optimization algorithms differ in how this updating value at a3 will lead to another sub-optimal solution at
routine is conducted. The iterative process, as shown by point C.
a series of arrows in Fig. 3, continues until the Unfortunately, there exists no general solution to the
parameters are judged to have converged (i.e., point B local maximum problem. Instead, a variety of techni-
in Fig. 3) on the optimal set of parameters on an ques have been developed in an attempt to avoid the
appropriately predened criterion. Examples of the problem, though there is no guarantee of their
stopping criterion include the maximum number of effectiveness. For example, one may choose different
iterations allowed or the minimum amount of change in starting values over multiple runs of the iteration
parameter values between two successive iterations. procedure and then examine the results to see whether
the same solution is obtained repeatedly. When that
happens, one can conclude with some condence that a
3.2. Local maxima global maximum has been found.2
3.3. Relation to least-squares estimation time, and pw; t the models prediction of the prob-
ability of correct recall at time t: The two models are
Recall that in MLE we seek the parameter values that dened as
are most likely to have produced the data. In LSE, on power model : pw; t w1 tw2 w1 ; w2 40;
the other hand, we seek the parameter values that
provide the most accurate description of the data, exponential model : pw; t w1 expw2 t 13
measured in terms of how closely the model ts the w1 ; w2 40:
data under the square-loss function. Formally, in LSE,
Suppose that data y y1 ; y; ym consists of m
the sum of squares error (SSE) between observations and
observations in which yi 0pyi p1 represents an ob-
predictions is minimized:
served proportion of correct recall at time ti i
Xm
1; y; m: We are interested in testing the viability of
SSEw yi prdi w2 ; 12
these models. We do this by tting each to observed data
i1
and examining its goodness of t.
where prdi w denotes the models prediction for the ith Application of MLE requires specication of the PDF
observation. Note that SSEw is a function of the f yjw of the data under each model. To do this, rst we
parameter vector w w1 ; y; wk : note that each observed proportion yi is obtained by
As in MLE, nding the parameter values that dividing the number of correct responses xi by the
minimize SSE generally requires use of a non-linear total number of independent trials n; yi
optimization algorithm. Minimization of LSE is also xi =n 0pyi p1 We then note that each xi is binomially
subject to the local minima problem, especially when the distributed with probability pw; t so that the PDFs for
model is non-linear with respect to its parameters. The the power model and the exponential model are
choice between the two methods of estimation can have obtained as
non-trivial consequences. In general, LSE estimates tend
n!
to differ from MLE estimates, especially for data that power : f xi j n; w
n xi !xi !
are not normally distributed such as proportion correct
2 xi 2 nxi
and response time. An implication is that one might w1 tw
i 1 w1 twi ;
possibly arrive at different conclusions about the same n! 14
data set depending upon which method of estimation is exponential : f xi j n; w
n xi !xi !
employed in analyzing the data. When this occurs, MLE
w1 expw2 ti xi
should be preferred to LSE, unless the probability
density function is unknown or difcult to obtain in an 1 w1 expw2 ti nxi ;
easily computable form, for instance, for the diffusion where xi 0; 1; y; n; i 1; y; m:
model of recognition memory (Ratcliff, 1978).3 There is There are two points to be made regarding the PDFs
a situation, however, in which the two methods in the above equation. First, the probability parameter
intersect. This is when observations are independent of of a binomial probability distribution (i.e. w in Eq. (4))
one another and are normally distributed with a is being modeled. Therefore, the PDF for each model in
constant variance. In this case, maximization of the Eq. (14) is obtained by simply replacing the probability
log-likelihood is equivalent to minimization of SSE, and parameter w in Eq. (4) with the model equation, pw; t; in
therefore, the same parameter values are obtained under Eq. (13). Second, note that yi is related to xi by a xed
either MLE or LSE. scaling constant, 1=n: As such, any statistical conclusion
regarding xi is applicable directly to yi ; except for the scale
transformation. In particular, the PDF for yi ; f yi jn; w;
4. Illustrative example is obtained by simply replacing xi in f xi jn; w with nyi :
Now, assuming that xi s are statistically independent
In this section, I present an application example of of one another, the desired log-likelihood function for
maximum likelihood estimation. To illustrate the the power model is given by
method, I chose forgetting data given the recent surge
ln Lw w1 ; w2 jn; x
of interest in this topic (e.g. Rubin & Wenzel, 1996;
Wickens, 1998; Wixted & Ebbesen, 1991). lnf x1 jn; w f x2 j n; w?f xm j n; w
Among a half-dozen retention functions that have Xm
Fig. 4. Modeling forgetting data. Squares represent the data in Murdock (1961). The thick (respectively, thin) curves are best ts by the power
(respectively, exponential) models.
Table 1
Summary ts of Murdock (1961) data for the power and exponential models under the maximum likelihood estimation (MLE) method and the least-
squares estimation (LSE) method.
MLE LSE
Note: For each model tted, the rst row shows the maximized log-likelihood value for MLE and the minimized sum of squares error value for LSE.
Each number in the parenthesis is the proportion of variance accounted for (i.e. r2 ) in that case. The second and third rows show MLE and LSE
parameter estimates for each of w1 and w2 : The above results were obtained using Matlab code described in the appendix.
This quantity is to be maximized with respect to the yield the observed data y1 ; y; y6 0:94; 0:77; 0:40;
two parameters, w1 and w2 : It is worth noting that the 0:26; 0:24; 0:16; from which the number of correct
last three terms of the nal expression in the above responses, xi ; is obtained as 100yi ; i 1; y; 6: In
equation (i.e., ln n! lnn xi ! ln xi !) do not depend Fig. 4, the proportion recall data are shown as squares.
upon the parameter vector, thereby do not affecting the The curves in Fig. 4 are best ts obtained under MLE.
MLE results. Accordingly, these terms can be ignored, Table 1 summarizes the MLE results, including t
and their values are often omitted in the calculation of measures and parameter estimates, and also include the
the log-likelihood. Similarly, for the exponential model, LSE results, for comparison. Matlab code used for the
its log-likelihood function can be obtained from Eq. (15) calculations is included in the appendix.
by substituting w1 expw2 ti for w1 twi
2
: The results in Table 1 indicate that under either
In illustrating MLE, I used a data set from Murdock method of estimation, the exponential model t better
(1961). In this experiment subjects were presented with a than the power model. That is, for the former, the log-
set of words or letters and were asked to recall the items likelihood was larger and the SSE smaller than for the
after six different retention intervals, t1 ; y; t6 latter. The same conclusion can be drawn even in terms
1; 3; 6; 9; 12; 18 in seconds and thus, m 6: The of r2 : Also note the appreciable discrepancies in
proportion recall at each retention interval was calcu- parameter estimate between MLE and LSE. These
lated based on 100 independent trials (i.e. n 100) to differences are not unexpected and are due to the fact
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 97
that the proportion data are binomially distributed, not issues in model selection, the reader is referred elsewhere
normally distributed. Further, the constant variance assump- (e.g. Linhart & Zucchini, 1986; Myung, Forster, &
tion required for the equivalence between MLE and LSE Browne, 2000; Pitt, Myung, & Zhang, 2002).
does not hold for binomial data for which the variance, s2
np1 p; depends upon proportion correct p:
5. Concluding remarks
Appendix
This appendix presents Matlab code that performs MLE and LSE analyses for the example described in the text.
while 1,
w1; lik1; exit1
fmincon (power mle,init w,[],[],[],[],low w,up w,[],opts);
98 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100
% optimization for power model that minimizes minus log-likelihood (note that minimization of
minus log-likelihood is equivalent to maximization of log-likelihood)
% w1: MLE parameter estimates
% lik1: maximized log-likelihood value
% exit1: optimization has converged if exit1 40 or not otherwise
w2; lik2; exit2
FMINCONEXPO MLE; INIT W;
;
;
;
; LOW W; UP W;
; OPTS;
% optimization for exponential model that minimizes minus log-likelihood
prd1 w11; 1n t:#-w12; 1;% best fit prediction by power model
r21; 1 1-sumprd1-y:#2=sumy-meany:#2; % r#2 for power model
prd2 w21; 1n exp-w22; 1n t;% best fit prediction by exponential model
r22; 1 1-sumprd2-y:#2=sumy-meany:#2; %r#2 for exponential model
if sumr240 2
break;
else
init w rand2; 1;
end;
end;
format long;
disp(num2str([w1 w2 r2],5));% display results
disp(num2str([lik1 lik2 exit1 exit2],5));% display results
end % end of the main program
format long;
disp(num2str([w1 w2 r2],5));% display out results
disp(num2str([sse1 sse2 exit1 exi2],5));% display out results
end % end of the main program
Rubin, D. C., & Wenzel, A. E. (1996). One hundred years of Van Zandt, T. (2000). How to t a response time distribution.
forgetting: A quantitative description of retention. Psychological Psychonomic Bulletin & Review, 7(3), 424465.
Review, 103, 734760. Wickens, T. D. (1998). On the form of the retention function:
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Comment on Rubin and Wenzel (1996): A quantitative description
Statistics, 6, 461464. of retention. Psychological Review, 105, 379386.
Spanos, A. (1999). Probability theory and statistical inference. Cam- Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting.
bridge, UK: Cambridge University Press. Psychological Science, 2, 409415.
Usher, M., & McClelland, J. L. (2001). The time course of perceptual
choice; The leaky, competing accumulator model. Psychological
Review, 108(3), 550592.