Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
0 of .
Results for:
P. 1
Tutorial on Maximum Likelihood Estimation

# Tutorial on Maximum Likelihood Estimation

Ratings:

4.0

(1)
|Views: 3,560|Likes:

### Availability:

See more
See less

03/18/2014

pdf

text

original

Journal of Mathematical Psychology 47 (2003) 90–100
Tutorial
Tutorial on maximum likelihood estimation
In Jae Myung*
Department of Psychology, Ohio State University, 1885 Neil Avenue Mall, Columbus, OH 43210-1222, USA
Received 30 November 2001; revised 16 October 2002
Abstract
In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial areresearchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squaresestimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is anindispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purposeof this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of some of the basic principles.
r
1. Introduction
In psychological science, we seek to uncover generallaws and principles that govern the behavior underinvestigation. As these laws and principles are notdirectly observable, they are formulated in terms of hypotheses. In mathematical modeling, such hypo-theses about the structure and inner working of thebehavioral process of interest are stated in terms of parametric families of probability distributions calledmodels. The goal of modeling is to deduce the form of the underlying process by testing the viability of suchmodels.Once a model is speciﬁed with its parameters, anddata have been collected, one is in a position to evaluateits goodness of ﬁt, that is, how well it ﬁts the observeddata. Goodness of ﬁt is assessed by ﬁnding parametervalues of a model that best ﬁts the data—a procedurecalled
parameter estimation
.There are two general methods of parameter estima-tion. They are least-squares estimation (LSE) andmaximum likelihood estimation (MLE). The formerhas been a popular choice of model ﬁtting in psychology but seeUsher&McClelland, 2001) and is tied to many familiar statistical concepts such as linear regression,sum of squares error, proportion variance accounted for(i.e.
r
2
), and root mean squared deviation. LSE, whichunlike MLE requires no or minimal distributionalassumptions, is useful for obtaining a descriptivemeasure for the purpose of summarizing observed data,but it has no basis for testing hypotheses or constructingconﬁdence intervals.On the other hand, MLE is not as widely recognizedamong modelers in psychology, but it is a standardapproach to parameter estimation and inference instatistics. MLE has many optimal properties in estima-tion: sufﬁciency (complete information about the para-meter of interest contained in its MLE estimator);consistency (true parameter value that generated thedata recovered asymptotically, i.e. for data of sufﬁ-ciently large samples); efﬁciency (lowest-possible var-iance of parameter estimates achieved asymptotically);and parameterization invariance (same MLE solutionobtained independent of the parametrization used). Incontrast, no such things can be said about LSE. As such,most statisticians would not view LSE as a generalmethod for parameter estimation, but rather as anapproach that is primarily used with linear regressionmodels. Further, many of the inference methods instatistics are developed based on MLE. For example,MLE is a prerequisite for the chi-square test, the G-square test, Bayesian methods, inference with missingdata, modeling of random effects, and many modelselection criteria such as the Akaike informationcriterion (Akaike, 1973) and the Bayesian informationcriteria (Schwarz, 1978).
*Fax: +614-292-5601.
myung.1@osu.edu.0022-2496/03/\$-see front matter
r

In this tutorial paper, I introduce the maximumlikelihood estimation method for mathematical model-ing. The paper is written for researchers who areprimarily involved in empirical work and publish inexperimental journals (e.g.
Journal of Experimental Psychology
) but do modeling. The paper is intended toserve as a stepping stone for the modeler to movebeyond the current practice of using LSE to moreinformed modeling analyses, thereby expanding his orher repertoire of statistical instruments, especially innon-linear modeling. The purpose of the paper is toprovide a good conceptual understanding of the methodwith concrete examples. For in-depth, technically morerigorous treatment of the topic, the reader is directed toother sources (e.g.,Bickel&Doksum, 1977, Chap. 3; Casella&Berger, 2002, Chap. 7;DeGroot&Schervish, 2002, Chap. 6;Spanos, 1999, Chap. 13).
2. Model speciﬁcation
2.1. Probability density function
From a statistical standpoint, the data vector
y
¼ð
y
1
;
y
;
y
m
Þ
is a random sample from an unknownpopulation. The goal of data analysis is to identify thepopulation that is most likely to have generated thesample. In statistics, each population is identiﬁed by acorresponding probability distribution. Associated witheach probability distribution is a unique value of themodel’s parameter. As the parameter changes in value,different probability distributions are generated. For-mally, a model is deﬁned as the family of probabilitydistributions indexed by the model’s parameters.Let
ð
y
j
w
Þ
denote the
probability density function
(PDF) that speciﬁes the probability of observing datavector
y
given the parameter
w
:
Throughout this paperwe will use a plain letter for a vector (e.g.
y
) and a letterwith a subscript for a vector element (e.g.
y
). Theparameter
w
¼ ð
w
1
;
y
;
w
Þ
is a vector deﬁned on amulti-dimensional parameter space. If individual ob-servations,
y
’s, are statistically independent of oneanother, then according to the theory of probability, thePDF for the data
y
¼ ð
y
1
;
y
;
y
m
Þ
given the parametervector
w
can be expressed as a multiplication of PDFsfor individual observations,
f
ð
y
¼ ð
y
1
;
y
2
;
y
;
y
n
Þ j
w
Þ ¼
f
1
ð
y
1
j
w
Þ
2
ð
y
2
j
w
Þ
?
f
n
ð
y
m
j
w
Þ
:
ð
1
Þ
To illustrate the idea of a PDF, consider the simplestcase with one observation and one parameter, that is,
m
¼
¼
1
:
Suppose that the data
y
represents thenumber of successes in a sequence of 10 Bernoulli trials(e.g. tossing a coin 10 times) and that the probability of a success on any one trial, represented by the parameter
w
;
is 0.2. The PDF in this case is given by
f
ð
y
j
n
¼
10
;
w
¼
0
:
2
Þ ¼
10
!
y
!
ð
10
À
y
Þ
!
ð
0
:
2
Þ
y
ð
0
:
8
Þ
10
À
y
ð
y
¼
0
;
1
;
y
;
10
Þ ð
2
Þ
Fig. 1. Binomial probability distributions of sample size
n
¼
10 and probability parameter
w
¼
0
:
2 (top) and
w
¼
0
:
7 (bottom).
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
91

which is known as the binomial distribution withparameters
n
¼
10
;
w
¼
0
:
2
:
Note that the number of trials
ð
n
Þ
is considered as a parameter. The shape of thisPDF is shown in the top panel of Fig. 1. If theparameter value is changed to say
w
¼
0
:
7
;
a new PDFis obtained as
f
ð
y
j
n
¼
10
;
w
¼
0
:
7
Þ ¼
10
!
y
!
ð
10
À
y
Þ
!
ð
0
:
7
Þ
y
ð
0
:
3
Þ
10
À
y
ð
y
¼
0
;
1
;
y
;
10
Þ ð
3
Þ
whose shape is shown in the bottom panel of Fig. 1. Thefollowing is the general expression of the PDF of thebinomial distribution for arbitrary values of
w
and
n
:
f
ð
y
j
n
;
w
Þ ¼
n
!
y
!
ð
n
À
y
Þ
!
w
y
ð
1
À
w
Þ
n
À
y
ð
0
p
w
p
1
;
y
¼
0
;
1
;
y
;
n
Þ ð
4
Þ
which as a function of
y
speciﬁes the probability of data
y
for a given value of
n
and
w
:
The collection of all suchPDFs generated by varying the parameter across itsrange (0–1 in this case for
w
;
n
X
1) deﬁnes a model.
2.2. Likelihood function
Given a set of parameter values, the correspondingPDF will show that some data are more probable thanother data. In the previous example, the PDF with
w
¼
0
:
2
;
y
¼
2 is more likely to occur than
y
¼
5 (0.302 vs.0.026). In reality, however, we have already observed thedata. Accordingly, we are faced with an inverseproblem: Given the observed data and a model of interest, ﬁnd the one PDF, among all the probabilitydensities that the model prescribes, that is most likely tohave produced the data. To solve this inverse problem,we deﬁne the
likelihood function
by reversing the roles of the data vector
y
and the parameter vector
w
in
ð
y
j
w
Þ
;
i.e.
L
ð
w
j
y
Þ ¼
ð
y
j
w
Þ
:
ð
5
Þ
Thus
L
ð
w
j
y
Þ
represents the likelihood of the parameter
w
given the observed data
y
;
and as such is a function of
w
:
For the one-parameter binomial example in Eq. (4),the likelihood function for
y
¼
7 and
n
¼
10 is given by
L
ð
w
j
n
¼
10
;
y
¼
7
Þ ¼
ð
y
¼
7
j
n
¼
10
;
w
Þ¼
10
!
7
!
3
!
w
7
ð
1
À
w
Þ
3
ð
0
p
w
p
1
Þ
:
ð
6
Þ
The shape of this likelihood function is shown inFig. 2.There exist an important difference between the PDF
f
ð
y
j
w
Þ
and the likelihood function
L
ð
w
j
y
Þ
:
As illustratedinFigs. 1 and 2, the two functions are deﬁned ondifferent axes, and therefore are not directly comparableto each other. Speciﬁcally, the PDF inFig. 1is afunction of the data given a particular set of parametervalues, deﬁned on the
data scale
. On the other hand, thelikelihood function is a function of the parameter givena particular set of observed data, deﬁned on the
parameter scale
. In short,Fig. 1tells us the probabilityof a particular data value for a ﬁxed parameter, whereasFig. 2tells us the likelihood (‘‘unnormalized probabil-ity’’) of a particular parameter value for a ﬁxed data set.Note that the likelihood function in this ﬁgure is a curve
Fig. 2. The likelihood function given observed data
y
¼
7 and sample size
n
¼
10 for the one-parameter model described in the text.
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
92