You are on page 1of 11

Journal of Mathematical Psychology 47 (2003) 90100

Tutorial
Tutorial on maximum likelihood estimation
In Jae Myung*
Department of Psychology, Ohio State University, 1885 Neil Avenue Mall, Columbus, OH 43210-1222, USA
Received 30 November 2001; revised 16 October 2002

Abstract

In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are
researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squares
estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an
indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose
of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of
some of the basic principles.
r 2003 Elsevier Science (USA). All rights reserved.

1. Introduction (i.e. r2 ), and root mean squared deviation. LSE, which


unlike MLE requires no or minimal distributional
In psychological science, we seek to uncover general assumptions, is useful for obtaining a descriptive
laws and principles that govern the behavior under measure for the purpose of summarizing observed data,
investigation. As these laws and principles are not but it has no basis for testing hypotheses or constructing
directly observable, they are formulated in terms of condence intervals.
hypotheses. In mathematical modeling, such hypo- On the other hand, MLE is not as widely recognized
theses about the structure and inner working of the among modelers in psychology, but it is a standard
behavioral process of interest are stated in terms of approach to parameter estimation and inference in
parametric families of probability distributions called statistics. MLE has many optimal properties in estima-
models. The goal of modeling is to deduce the form of tion: sufciency (complete information about the para-
the underlying process by testing the viability of such meter of interest contained in its MLE estimator);
models. consistency (true parameter value that generated the
Once a model is specied with its parameters, and data recovered asymptotically, i.e. for data of suf-
data have been collected, one is in a position to evaluate ciently large samples); efciency (lowest-possible var-
its goodness of t, that is, how well it ts the observed iance of parameter estimates achieved asymptotically);
data. Goodness of t is assessed by nding parameter and parameterization invariance (same MLE solution
values of a model that best ts the dataa procedure obtained independent of the parametrization used). In
called parameter estimation. contrast, no such things can be said about LSE. As such,
There are two general methods of parameter estima- most statisticians would not view LSE as a general
tion. They are least-squares estimation (LSE) and method for parameter estimation, but rather as an
maximum likelihood estimation (MLE). The former approach that is primarily used with linear regression
has been a popular choice of model tting in psychology models. Further, many of the inference methods in
(e.g., Rubin, Hinton, & Wenzel, 1999; Lamberts, 2000 statistics are developed based on MLE. For example,
but see Usher & McClelland, 2001) and is tied to many MLE is a prerequisite for the chi-square test, the G-
familiar statistical concepts such as linear regression, square test, Bayesian methods, inference with missing
sum of squares error, proportion variance accounted for data, modeling of random effects, and many model
selection criteria such as the Akaike information
*Fax: +614-292-5601. criterion (Akaike, 1973) and the Bayesian information
E-mail address: myung.1@osu.edu. criteria (Schwarz, 1978).

0022-2496/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved.
doi:10.1016/S0022-2496(02)00028-7
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 91

In this tutorial paper, I introduce the maximum models parameter. As the parameter changes in value,
likelihood estimation method for mathematical model- different probability distributions are generated. For-
ing. The paper is written for researchers who are mally, a model is dened as the family of probability
primarily involved in empirical work and publish in distributions indexed by the models parameters.
experimental journals (e.g. Journal of Experimental Let f yjw denote the probability density function
Psychology) but do modeling. The paper is intended to (PDF) that species the probability of observing data
serve as a stepping stone for the modeler to move vector y given the parameter w: Throughout this paper
beyond the current practice of using LSE to more we will use a plain letter for a vector (e.g. y) and a letter
informed modeling analyses, thereby expanding his or with a subscript for a vector element (e.g. yi ). The
her repertoire of statistical instruments, especially in parameter w w1 ; y; wk is a vector dened on a
non-linear modeling. The purpose of the paper is to multi-dimensional parameter space. If individual ob-
provide a good conceptual understanding of the method servations, yi s, are statistically independent of one
with concrete examples. For in-depth, technically more another, then according to the theory of probability, the
rigorous treatment of the topic, the reader is directed to PDF for the data y y1 ; y; ym given the parameter
other sources (e.g., Bickel & Doksum, 1977, Chap. 3; vector w can be expressed as a multiplication of PDFs
Casella & Berger, 2002, Chap. 7; DeGroot & Schervish, for individual observations,
2002, Chap. 6; Spanos, 1999, Chap. 13).
f y y1 ; y2 ; y; yn j w f1 y1 j w f2 y2 j w
?fn ym j w: 1

2. Model specication To illustrate the idea of a PDF, consider the simplest


case with one observation and one parameter, that is,
2.1. Probability density function m k 1: Suppose that the data y represents the
number of successes in a sequence of 10 Bernoulli trials
From a statistical standpoint, the data vector y (e.g. tossing a coin 10 times) and that the probability of
y1 ; y; ym is a random sample from an unknown a success on any one trial, represented by the parameter
population. The goal of data analysis is to identify the w; is 0.2. The PDF in this case is given by
population that is most likely to have generated the 10!
sample. In statistics, each population is identied by a f y j n 10; w 0:2 0:2y 0:810y
y!10  y!
corresponding probability distribution. Associated with
each probability distribution is a unique value of the y 0; 1; y; 10 2

Fig. 1. Binomial probability distributions of sample size n 10 and probability parameter w 0:2 (top) and w 0:7 (bottom).
92 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100

which is known as the binomial distribution with interest, nd the one PDF, among all the probability
parameters n 10; w 0:2: Note that the number of densities that the model prescribes, that is most likely to
trials n is considered as a parameter. The shape of this have produced the data. To solve this inverse problem,
PDF is shown in the top panel of Fig. 1. If the we dene the likelihood function by reversing the roles of
parameter value is changed to say w 0:7; a new PDF the data vector y and the parameter vector w in f yjw;
is obtained as i.e.
10! Lwjy f yjw: 5
f y j n 10; w 0:7 0:7y 0:310y
y!10  y!
Thus Lwjy represents the likelihood of the parameter
y 0; 1; y; 10 3
w given the observed data y; and as such is a function of
whose shape is shown in the bottom panel of Fig. 1. The w: For the one-parameter binomial example in Eq. (4),
following is the general expression of the PDF of the the likelihood function for y 7 and n 10 is given by
binomial distribution for arbitrary values of w and n:
Lw j n 10; y 7 f y 7 j n 10; w
n!
f yjn; w wy 1  wny 10! 7
y!n  y! w 1  w3 0pwp1: 6
7!3!
0pwp1; y 0; 1; y; n 4
The shape of this likelihood function is shown in Fig. 2.
which as a function of y species the probability of data There exist an important difference between the PDF
y for a given value of n and w: The collection of all such f yjw and the likelihood function Lwjy: As illustrated
PDFs generated by varying the parameter across its in Figs. 1 and 2, the two functions are dened on
range (01 in this case for w; nX1) denes a model. different axes, and therefore are not directly comparable
to each other. Specically, the PDF in Fig. 1 is a
2.2. Likelihood function function of the data given a particular set of parameter
values, dened on the data scale. On the other hand, the
Given a set of parameter values, the corresponding likelihood function is a function of the parameter given
PDF will show that some data are more probable than a particular set of observed data, dened on the
other data. In the previous example, the PDF with w parameter scale. In short, Fig. 1 tells us the probability
0:2; y 2 is more likely to occur than y 5 (0.302 vs. of a particular data value for a xed parameter, whereas
0.026). In reality, however, we have already observed the Fig. 2 tells us the likelihood (unnormalized probabil-
data. Accordingly, we are faced with an inverse ity) of a particular parameter value for a xed data set.
problem: Given the observed data and a model of Note that the likelihood function in this gure is a curve

Fig. 2. The likelihood function given observed data y 7 and sample size n 10 for the one-parameter model described in the text.
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 93

because there is only one parameter beside n; which is at wi wi;MLE for all i 1; y; k: This is because the
assumed to be known. If the model has two parameters, denition of maximum or minimum of a continuous
the likelihood function will be a surface sitting above the differentiable function implies that its rst derivatives
parameter space. In general, for a model with k vanish at such points.
parameters, the likelihood function Lwjy takes the The likelihood equation represents a necessary con-
shape of a k-dim geometrical surface sitting above a dition for the existence of an MLE estimate. An
k-dim hyperplane spanned by the parameter vector w additional condition must also be satised to ensure
w1 ; y; wk : that ln Lwjy is a maximum and not a minimum, since
the rst derivative cannot reveal this. To be a maximum,
the shape of the log-likelihood function should be
3. Maximum likelihood estimation convex (it must represent a peak, not a valley) in the
neighborhood of wMLE : This can be checked by
Once data have been collected and the likelihood calculating the second derivatives of the log-likelihoods
function of a model given the data is determined, one is and showing whether they are all negative at wi wi;MLE
in a position to make statistical inferences about the for i 1; y; k;1
population, that is, the probability distribution that @ 2 ln Lwjy
underlies the data. Given that different parameter values o0: 8
@w2i
index different probability distributions (Fig. 1), we are
interested in nding the parameter value that corre- To illustrate the MLE procedure, let us again consider
sponds to the desired probability distribution. the previous one-parameter binomial example given a
The principle of maximum likelihood estimation xed value of n: First, by taking the logarithm of the
(MLE), originally developed by R.A. Fisher in the likelihood function Lwjn 10; y 7 in Eq. (6), we
1920s, states that the desired probability distribution is obtain the log-likelihood as
the one that makes the observed data most likely, 10!
which means that one must seek the value of the ln Lw j n 10; y 7 ln 7 ln w 3 ln1  w:9
7!3!
parameter vector that maximizes the likelihood function Next, the rst derivative of the log-likelihood is
Lwjy: The resulting parameter vector, which is sought calculated as
by searching the multi-dimensional parameter space, is
called the MLE estimate, and is denoted by wMLE d ln Lw j n 10; y 7 7 3 7  10w
 : 10
w1;MLE ; y; wk;MLE : For example, in Fig. 2, the MLE dw w 1  w w1  w
estimate is wMLE 0:7 for which the maximized like- By requiring this equation to be zero, the desired MLE
lihood value is LwMLE 0:7jn 10; y 7 0:267: estimate is obtained as wMLE 0:7: To make sure that
The probability distribution corresponding to this the solution represents a maximum, not a minimum, the
MLE estimate is shown in the bottom panel of Fig. 1. second derivative of the log-likelihood is calculated and
According to the MLE principle, this is the population evaluated at w wMLE ;
that is most likely to have generated the observed data
of y 7: To summarize, maximum likelihood estima- d 2 ln Lw j n 10; y 7 7 3
 2
tion is a method to seek the probability distribution that dw2 w 1  w2
makes the observed data most likely.  47:62o0 11
which is negative, as desired.
3.1. Likelihood equation
In practice, however, it is usually not possible to
obtain an analytic form solution for the MLE estimate,
MLE estimates need not exist nor be unique. In this
especially when the model involves many parameters
section, we show how to compute MLE estimates when
and its PDF is highly non-linear. In such situations, the
they exist and are unique. For computational conve-
MLE estimate must be sought numerically using non-
nience, the MLE estimate is obtained by maximizing the
linear optimization algorithms. The basic idea of non-
log-likelihood function, ln Lwjy: This is because the
linear optimization is to quickly nd optimal parameters
two functions, ln Lwjy and Lwjy; are monotonically
that maximize the log-likelihood. This is done by
related to each other so the same MLE estimate is
obtained by maximizing either one. Assuming that the
1
log-likelihood function, ln Lwjy; is differentiable, if Consider the Hessian matrix Hw dened as Hij w
wMLE exists, it must satisfy the following partial 2
@ ln Lw
i; j 1; y; k: Then a more accurate test of the convexity
differential equation known as the likelihood equation: @wi @wj
condition requires that the determinant of Hw be negative definite,
@ln Lwjy
0 7 that is, z0 Hw wMLE zo0 for any kx1 real-numbered vector z; where
@wi z0 denotes the transpose of z:
94 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100

Fig. 3. A schematic plot of the log-likelihood function for a ctitious one-parameter model. Point B is the global maximum whereas points A and C
are two local maxima. The series of arrows depicts an iterative optimization process.

searching much smaller sub-sets of the multi-dimen- tries to improve upon an initial set of parameters that is
sional parameter space rather than exhaustively search- supplied by the user. Initial parameter values are chosen
ing the whole parameter space, which becomes either at random or by guessing. Depending upon the
intractable as the number of parameters increases. The choice of the initial parameter values, the algorithm
intelligent search proceeds by trial and error over the could prematurely stop and return a sub-optimal set of
course of a series of iterative steps. Specically, on each parameter values. This is called the local maxima
iteration, by taking into account the results from the problem. As an example, in Fig. 3 note that although
previous iteration, a new set of parameter values is the starting parameter value at point a2 will lead to the
obtained by adding small changes to the previous optimal point B called the global maximum, the starting
parameters in such a way that the new parameters are parameter value at point a1 will lead to point A, which is
likely to lead to improved performance. Different a sub-optimal solution. Similarly, the starting parameter
optimization algorithms differ in how this updating value at a3 will lead to another sub-optimal solution at
routine is conducted. The iterative process, as shown by point C.
a series of arrows in Fig. 3, continues until the Unfortunately, there exists no general solution to the
parameters are judged to have converged (i.e., point B local maximum problem. Instead, a variety of techni-
in Fig. 3) on the optimal set of parameters on an ques have been developed in an attempt to avoid the
appropriately predened criterion. Examples of the problem, though there is no guarantee of their
stopping criterion include the maximum number of effectiveness. For example, one may choose different
iterations allowed or the minimum amount of change in starting values over multiple runs of the iteration
parameter values between two successive iterations. procedure and then examine the results to see whether
the same solution is obtained repeatedly. When that
happens, one can conclude with some condence that a
3.2. Local maxima global maximum has been found.2

It is worth noting that the optimization algorithm 2


A stochastic optimization algorithm known as simulated annealing
does not necessarily guarantee that a set of parameter
(Kirkpatrick, Gelatt, & Vecchi, 1983) can overcome the local maxima
values that uniquely maximizes the log-likelihood will be problem, at least in theory, though the algorithm may not be a feasible
found. Finding optimum parameters is essentially a option in practice as it may take an realistically long time to nd the
heuristic process in which the optimization algorithm solution.
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 95

3.3. Relation to least-squares estimation time, and pw; t the models prediction of the prob-
ability of correct recall at time t: The two models are
Recall that in MLE we seek the parameter values that dened as
are most likely to have produced the data. In LSE, on power model : pw; t w1 tw2 w1 ; w2 40;
the other hand, we seek the parameter values that
provide the most accurate description of the data, exponential model : pw; t w1 expw2 t 13
measured in terms of how closely the model ts the w1 ; w2 40:
data under the square-loss function. Formally, in LSE,
Suppose that data y y1 ; y; ym consists of m
the sum of squares error (SSE) between observations and
observations in which yi 0pyi p1 represents an ob-
predictions is minimized:
served proportion of correct recall at time ti i
Xm
1; y; m: We are interested in testing the viability of
SSEw yi  prdi w2 ; 12
these models. We do this by tting each to observed data
i1
and examining its goodness of t.
where prdi w denotes the models prediction for the ith Application of MLE requires specication of the PDF
observation. Note that SSEw is a function of the f yjw of the data under each model. To do this, rst we
parameter vector w w1 ; y; wk : note that each observed proportion yi is obtained by
As in MLE, nding the parameter values that dividing the number of correct responses xi by the
minimize SSE generally requires use of a non-linear total number of independent trials n; yi
optimization algorithm. Minimization of LSE is also xi =n 0pyi p1 We then note that each xi is binomially
subject to the local minima problem, especially when the distributed with probability pw; t so that the PDFs for
model is non-linear with respect to its parameters. The the power model and the exponential model are
choice between the two methods of estimation can have obtained as
non-trivial consequences. In general, LSE estimates tend
n!
to differ from MLE estimates, especially for data that power : f xi j n; w
n  xi !xi !
are not normally distributed such as proportion correct
2 xi 2 nxi
and response time. An implication is that one might w1 tw
i 1  w1 twi ;
possibly arrive at different conclusions about the same n! 14
data set depending upon which method of estimation is exponential : f xi j n; w
n  xi !xi !
employed in analyzing the data. When this occurs, MLE
w1 expw2 ti xi
should be preferred to LSE, unless the probability
density function is unknown or difcult to obtain in an 1  w1 expw2 ti nxi ;
easily computable form, for instance, for the diffusion where xi 0; 1; y; n; i 1; y; m:
model of recognition memory (Ratcliff, 1978).3 There is There are two points to be made regarding the PDFs
a situation, however, in which the two methods in the above equation. First, the probability parameter
intersect. This is when observations are independent of of a binomial probability distribution (i.e. w in Eq. (4))
one another and are normally distributed with a is being modeled. Therefore, the PDF for each model in
constant variance. In this case, maximization of the Eq. (14) is obtained by simply replacing the probability
log-likelihood is equivalent to minimization of SSE, and parameter w in Eq. (4) with the model equation, pw; t; in
therefore, the same parameter values are obtained under Eq. (13). Second, note that yi is related to xi by a xed
either MLE or LSE. scaling constant, 1=n: As such, any statistical conclusion
regarding xi is applicable directly to yi ; except for the scale
transformation. In particular, the PDF for yi ; f yi jn; w;
4. Illustrative example is obtained by simply replacing xi in f xi jn; w with nyi :
Now, assuming that xi s are statistically independent
In this section, I present an application example of of one another, the desired log-likelihood function for
maximum likelihood estimation. To illustrate the the power model is given by
method, I chose forgetting data given the recent surge
ln Lw w1 ; w2 jn; x
of interest in this topic (e.g. Rubin & Wenzel, 1996;
Wickens, 1998; Wixted & Ebbesen, 1991). lnf x1 jn; w  f x2 j n; w?f xm j n; w
Among a half-dozen retention functions that have Xm

been proposed and tested in the past, I provide an ln f xi jn; w


i1
example of MLE for the two functions, power and
X
m
exponential. Let w w1; w2 be the parameter vector, t xi lnw1 tw 2
n  xi ln1  w1 tw 2

i i
i1
3
For this model, the PDF is expressed as an innite sum of
transcendental functions. ln n!  lnn  xi !  ln xi !: 15
96 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100

Fig. 4. Modeling forgetting data. Squares represent the data in Murdock (1961). The thick (respectively, thin) curves are best ts by the power
(respectively, exponential) models.

Table 1
Summary ts of Murdock (1961) data for the power and exponential models under the maximum likelihood estimation (MLE) method and the least-
squares estimation (LSE) method.

MLE LSE

Power Exponential Power Exponential

Loglik/SSE r2 313:37 0:886 305:31 0:963 0.0540 (0.894) 0.0169 (0.967)


Parameter w1 0.953 1.070 1.003 1.092
Parameter w2 0.498 0.131 0.511 0.141

Note: For each model tted, the rst row shows the maximized log-likelihood value for MLE and the minimized sum of squares error value for LSE.
Each number in the parenthesis is the proportion of variance accounted for (i.e. r2 ) in that case. The second and third rows show MLE and LSE
parameter estimates for each of w1 and w2 : The above results were obtained using Matlab code described in the appendix.

This quantity is to be maximized with respect to the yield the observed data y1 ; y; y6 0:94; 0:77; 0:40;
two parameters, w1 and w2 : It is worth noting that the 0:26; 0:24; 0:16; from which the number of correct
last three terms of the nal expression in the above responses, xi ; is obtained as 100yi ; i 1; y; 6: In
equation (i.e., ln n!  lnn  xi !  ln xi !) do not depend Fig. 4, the proportion recall data are shown as squares.
upon the parameter vector, thereby do not affecting the The curves in Fig. 4 are best ts obtained under MLE.
MLE results. Accordingly, these terms can be ignored, Table 1 summarizes the MLE results, including t
and their values are often omitted in the calculation of measures and parameter estimates, and also include the
the log-likelihood. Similarly, for the exponential model, LSE results, for comparison. Matlab code used for the
its log-likelihood function can be obtained from Eq. (15) calculations is included in the appendix.
by substituting w1 expw2 ti for w1 twi
2
: The results in Table 1 indicate that under either
In illustrating MLE, I used a data set from Murdock method of estimation, the exponential model t better
(1961). In this experiment subjects were presented with a than the power model. That is, for the former, the log-
set of words or letters and were asked to recall the items likelihood was larger and the SSE smaller than for the
after six different retention intervals, t1 ; y; t6 latter. The same conclusion can be drawn even in terms
1; 3; 6; 9; 12; 18 in seconds and thus, m 6: The of r2 : Also note the appreciable discrepancies in
proportion recall at each retention interval was calcu- parameter estimate between MLE and LSE. These
lated based on 100 independent trials (i.e. n 100) to differences are not unexpected and are due to the fact
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 97

that the proportion data are binomially distributed, not issues in model selection, the reader is referred elsewhere
normally distributed. Further, the constant variance assump- (e.g. Linhart & Zucchini, 1986; Myung, Forster, &
tion required for the equivalence between MLE and LSE Browne, 2000; Pitt, Myung, & Zhang, 2002).
does not hold for binomial data for which the variance, s2
np1  p; depends upon proportion correct p:
5. Concluding remarks

4.1. MLE interpretation This article provides a tutorial exposition of max-


imum likelihood estimation. MLE is of fundamental
What does it mean when one model ts the data better importance in the theory of inference and is a basis of
than does a competitor model? It is important not to many inferential techniques in statistics, unlike LSE,
jump to the conclusion that the former model does a which is primarily a descriptive tool. In this paper, I
better job of capturing the underlying process and provide a simple, intuitive explanation of the method so
therefore represents a closer approximation to the true that the reader can have a grasp of some of the basic
model that generated the data. A good t is a necessary, principles. I hope the reader will apply the method in his
but not a sufcient, condition for such a conclusion. A or her mathematical modeling efforts so a plethora of
superior t (i.e., higher value of the maximized log- widely available MLE-based analyses (e.g. Batchelder &
likelihood) merely puts the model in a list of candidate Crowther, 1997; Van Zandt, 2000) can be performed on
models for further consideration. This is because a model data, thereby extracting as much information and
can achieve a superior t to its competitors for reasons insight as possible into the underlying mental process
that have nothing to do with the models delity to the under investigation.
underlying process. For example, it is well established in
statistics that a complex model with many parameters ts
data better than a simple model with few parameters, Acknowledgments
even if it is the latter that generated the data. The central
question is then how one should decide among a set of This work was supported by research Grant R01
competing models. A short answer is that a model should MH57472 from the National Institute of Mental Health.
be selected based on its generalizability, which is dened The author thanks Mark Pitt, Richard Schweickert, and
as a models ability to t current data but also to predict two anonymous reviewers for valuable comments on
future data. For a thorough treatment of this and related earlier versions of this paper.

Appendix
This appendix presents Matlab code that performs MLE and LSE analyses for the example described in the text.

Matlab Code for MLE


% This is the main program that finds MLE estimates. Given a model, it
% takes sample size (n), time intervals (t) and observed proportion correct
% (y) as inputs. It returns the parameter values that maximize the log-
% likelihood function
global n t x; % define global variables
opts optimset (DerivativeCheck,off,Display,off,TolX,1e-6,TolFun,1e-6,
Diagnostics,off,MaxIter,200,LargeScale,off);
% option settings for optimization algorithm
n 100 ;% number of independent Bernoulli trials (i.e., sample size)
t 1 3 6 9 12 18
0 ;% time intervals as a column vector
y :94 :77 :40 :26 :24 :16
0 ;% observed proportion correct as a column vector
x nn y;% number of correct responses

init w rand2; 1;% starting parameter values


low w zeros2; 1;% parameter lower bounds
up w 100n ones2; 1;% parameter upper bounds

while 1,
w1; lik1; exit1
fmincon (power mle,init w,[],[],[],[],low w,up w,[],opts);
98 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100

% optimization for power model that minimizes minus log-likelihood (note that minimization of
minus log-likelihood is equivalent to maximization of log-likelihood)
% w1: MLE parameter estimates
% lik1: maximized log-likelihood value
% exit1: optimization has converged if exit1 40 or not otherwise
w2; lik2; exit2
FMINCONEXPO MLE; INIT W;
;
;
;
; LOW W; UP W;
; OPTS;
% optimization for exponential model that minimizes minus log-likelihood
prd1 w11; 1n t:#-w12; 1;% best fit prediction by power model
r21; 1 1-sumprd1-y:#2=sumy-meany:#2; % r#2 for power model
prd2 w21; 1n exp-w22; 1n t;% best fit prediction by exponential model
r22; 1 1-sumprd2-y:#2=sumy-meany:#2; %r#2 for exponential model

if sumr240 2
break;
else
init w rand2; 1;
end;
end;

format long;
disp(num2str([w1 w2 r2],5));% display results
disp(num2str([lik1 lik2 exit1 exit2],5));% display results
end % end of the main program

function loglik power mlew


% POWER MLE The log-likelihood function of the power model
global n t x;
p w1; 1n t:#-w2; 1;% power model prediction given parameter
p p p zeros6; 1n 1e-5  p ones6; 1n 1e-5; % ensure 0opo1
loglik 1n x: * logp n-x: * log1-p;
% minus log-likelihood for individual observations
loglik sumloglik;% overall minus log-likelihood being minimized

function loglik expo mlew


% EXPO MLE The log-likelihood function of the exponential model
global n t x;
p w1; 1n exp-w2; 1n t;% exponential model prediction
p p p zeros6; 1n 1e-5  p ones6; 1n 1e-5; % ensure 0opo1
loglik 1n x: * logp n-x: * log1p;
% minus log-likelihood for individual observations
loglik sumloglik;% overall minus log-likelihood being minimized

Matlab Code for LSE


% This is the main program that finds LSE estimates. Given a model, it
% takes sample size (n), time intervals (t) and observed proportion correct
% (y) as inputs. It returns the parameter values that minimize the sum of
% squares error
global t; % define global variable
opts optimset(DerivativeCheck,off,Display,off,TolX,1e-6,TolFun,1e-6, Diagnostic-
s,off,MaxIter,200,LargeScale,off);
% option settings for optimization algorithm
n 100; % number of independent binomial trials (i.e., sample size)
t 1 3 6 9 12 18
0 ;% time intervals as a column vector
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100 99

y :94 :77 :40 :26 :24 :16


0 ;% observed proportion correct as a column vector

init w rand2; 1;% starting parameter values


low w zeros2; 1;% parameter lower bounds
up w 100n ones2; 1;% parameter upper bounds
w1; sse1; res1; exit1
lsqnonlinpower lse0 ; init w; low w; up w; opts; y;
% optimization for power model
% w1: LSE estimates
% sse1: minimized SSE value
% res1: value of the residual at the solution
% exit1: optimization has converged if exit1 40 or not otherwise
w2; sse2; res2; exit2
lsqnonlinexpo lse0 ; init w; low w; up w; opts; y;
% optimization for exponential model

r21; 1 1-sse1=sumy-meany:#2; % r#2 for power model


r22; 1 1-sse2=sumy-meany:#2; % r#2 for exponential model

format long;
disp(num2str([w1 w2 r2],5));% display out results
disp(num2str([sse1 sse2 exit1 exi2],5));% display out results
end % end of the main program

function dev power lsew; y


% POWER LSE The deviation between observation and prediction of the power
% model
global t;
p w1; 1n t:#-w2; 1;% power model prediction
dev p  y;
% deviation between prediction and observation, the square of which is
being minimized

function dev expo lsew; y


% EXPO LSE The deviation between observation and prediction of the
% exponential model
global t;
p w1; 1n expw2; 1n t;% exponential model prediction
dev p  y;
% deviation between prediction and observation, the square of which is
being minimized

References Lamberts, K. (2000). Information-accumulation theory of speeded


categorization. Psychological Review, 107(2), 227260.
Akaike, H. (1973). Information theory and an extension of Linhart, H., & Zucchini, W. (1986). Model selection. New York, NY:
the maximum likelihood principle. In: Petrox, B.N., & Caski, Wiley.
F. Second international symposium on information theory Murdock Jr., B. B. (1961). The retention of individual items. Journal of
(pp. 267281). Budapest: Akademiai Kiado. Experimental Psychology, 62, 618625.
Batchelder, W. H., & Crowther, C. S. (1997). Multinomial processing Myung, I. J., Forster, M., & Browne, M. W. (2000). Special issue
tree models of factorial categorization. Journal of Mathematical on model selection. Journal of Mathematical Psychology, 44,
Psychology, 41, 4555. 12.
Bickel, P. J., & Doksum, K. A. (1977). Mathematical statistics. Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Toward a method of
Oakland, CA: Holden-day, Inc. selecting among computational models of cognition. Psychological
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Review, 109, 472491.
Pacic Grove, CA: Duxberry. Ratcliff, R. (1978). A theory of memory retrieval. Psychological
DeGroot, M. H., & Schervish, M. J. (2002). Probability and statistics Review, 85, 59108.
(3rd ed.). Boston, MA: Addison-Wesley. Rubin, D. C., Hinton, S., & Wenzel, A. (1999). The precise time course
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by of retention. Journal of Experimental Psychology: Learning,
simulated annealing. Science, 220, 671680. Memory, and Cognition, 25, 11611176.
100 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90100

Rubin, D. C., & Wenzel, A. E. (1996). One hundred years of Van Zandt, T. (2000). How to t a response time distribution.
forgetting: A quantitative description of retention. Psychological Psychonomic Bulletin & Review, 7(3), 424465.
Review, 103, 734760. Wickens, T. D. (1998). On the form of the retention function:
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Comment on Rubin and Wenzel (1996): A quantitative description
Statistics, 6, 461464. of retention. Psychological Review, 105, 379386.
Spanos, A. (1999). Probability theory and statistical inference. Cam- Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting.
bridge, UK: Cambridge University Press. Psychological Science, 2, 409415.
Usher, M., & McClelland, J. L. (2001). The time course of perceptual
choice; The leaky, competing accumulator model. Psychological
Review, 108(3), 550592.

You might also like