This action might not be possible to undo. Are you sure you want to continue?

# Monte Carlo Statistical Methods

Christian P. Robert CREST, Insee, Paris George Casella Cornell University, Ithaca, NY Draft Version 1.1 February 27, 1998

CHAPTER 1

Introduction

Version 1.1 February 27, 1998 Until the advent of powerful and accessible computing methods, the experimenter was confronted with a di cult choice. Either describe an accurate model of a phenomenon, which would usually prevent the computation of explicit answers, or choose a standard model which would allow this computation, but would often not be a close representation of a realistic model. This dilemma is present in many branches of statistical applications, for example in electrical engineering, aeronautics, biology, networks, and astronomy. To use realistic models, the researchers in these disciplines have often developed original approaches for model tting that are customized for their own problems. (This is particularly true of physicists, the originators of Markov chain Monte Carlo methods.) Traditional methods of analysis, such as the usual numerical analysis techniques, turn out to be not well adapted for such settings. The rst section of this chapter presents a number of examples of statistical models, some of which were instrumental in developing the eld of simulation-based inference. The remaining sections describe the di culties speci c to most common statistical methods, while the nal section contains a comparison with numerical analysis techniques.

1.1 Statistical Models

In a purely statistical setup, computational di culties occur at both the level of probabilistic modeling of the inferred phenomenon and at the level of statistical inference on this model (estimation, prediction, tests, variable selection, etc.). In the rst case, a detailed representation of the causes of the phenomenon, such as accounting for potential explanatory variables linked to the phenomenon, can lead to a probabilistic structure which is too complex to allow for a parametric representation of the model. Moreover, there may be no provision for getting closed-form estimates of quantities of interest. A frequent setup with this type of complexity is an expert systems (in medicine, physics, nance, etc.) or more generally a graph structure. Figure 1.1.1 gives an example of such a structure analyzed in Spiegelhalter et al. (1993). It is related to the detection of a left ventricle hypertrophia (LVH), where the links between causes represent probabilistic dependen-

4

INTRODUCTION

1.1.1

n1: Birth asphyxia?

n2: Disease?

n3: Age at presentation?

n4: LVH?

n5: Duct flow?

n6: Cardiac mixing?

n7: Lung parenchyma?

n8: Lung flow?

n9: Sick?

n10: Hypoxia distribution?

n11: Hypoxia in O2?

n12: CO2?

n13: Chest X-ray?

n14: Grunting?

n15: LVH report?

n16: Lower body O2?

n17: Right up. quad. O2?

n18: CO2 report?

n19: X-ray report?

n20: Grunting report?

Figure 1.1.1. Probabilistic representation of links between causes of left ventricle hypertrophia (Source: Spiegelhalter et al., 1993)

cies. (The motivation behind the analysis is to improve the prediction of this disease.) In this case, the conditional distributions of nodes with respect to their parents lead to the joint distribution. See Robert1 (1991) or Lauritzen (1993) for other examples of complex expert systems where this reconstitution is impossible. A second setup where model complexity prohibits an explicit representation appears in econometrics (and in many other areas) for structures of latent (or missing) variable models. Given a \simple" model, aggregation or removal of some components of this model may sometimes induce such involved structures that simulation is truly the only way to draw an inference. (Chapter 9 provides a series of examples of such models where simulation methods are necessary.) Example 1.1.1 {Censored data models{ Censored data models can be considered to be missing data models where densities are not sampled directly. To obtain estimates, and make inferences, usually requires programming or computing time and precludes analytical answers. Barring cases where the censoring phenomenon can be ignored (see Chapter 9), several types of censoring can be categorized by their relation with an underlying (unobserved) model, Yi f (yi j ): (i) Given random variables Yi , which may be times of observation or concentrations, the actual observations are Yi = minfYi ; ug where u is the maximal observation duration or the smallest measurable concentration

1 Claudine, not Christian!

There. (iii) The variables Yi are associated with auxiliary variables Xi g such that yi = h(yi . Similarly. the observation of the censored variable Z = X ^ !.2) f(z) = z e? z I z ! + Z1 ! x e? x dx ! (z) . the additive form of a density.1. where ! is constant. k In some cases. namely the variable I yi >xi . if X has a Weibull distribution with two parameters.) Example 1. in a longitudinal study of a disease. the variable Z = z? ?1 ' z ? where ' is the density of the normal N (0. (Here.1.1. 2 ). while formally explicit. may be either known or unknown. The distributions (1. where a ( ) is the Dirac mass at a.1. Y ) is distributed as _ ?1 ' z? 1? z ? (1. prohibits the computation of the density of a sample (X1 . (ii) The original variables Yi are kept in the sample with probability (yi ) and the number of discarded variables is either known or unknown. if X N ( . In this case.2 {Mixture models{ Models of mixtures of distributions are based on the assumption that the observations are generated from one of k elementary distributions fi with probability pi . Similarly.1 ] STATISTICAL MODELS 5 rate. which is not easy to compute. P(X !). Xn) for n large. 1) distribution and is the corresponding cdf. the weight of the Dirac mass. xi) is the observation.1.1. Typically. The fact that truncation occurred. the overall density being p1f1 (x) + + pk fk (x) : . 2 ) and Y X ^ Y = min(X. some patients may leave the study either due to other death causes or by simply dropping out. cannot be explicitly computed. If the product is still functioning at the end of the experiment.2) appear naturally in quality control applications. the observation on failure time is censored. has the density (1. . ) and density f(x) = x ?1e? x on IR+ . where the quantity of interest is time to failure. As an example. W e( .1) + 1? N ( .1) and (1. xi). \explicit" has the restrictive meaning that \it can be computed in a reasonable time". xi) = min(yi .1. h(yi . testing of a product may be of a duration !.

see Problem 1.6 n Y INTRODUCTION 1.1 An expansion of the distribution of (X1 .1. . where for i = ?q.1.3). While the computation of standard moments like the mean or the variance of these distributions is feasible in most setups (and thus the derivation of moment estimators. . we look at a particularly important example in the processing of temporal (or time series) data where the likelihood cannot be written explicitly. ?1 the perturbations "?i can be interpreted as missing data (see Chapter 9). If the sample consists of the observation (X0 . q. Xn). k ^ i "1?i ? 1 "0 . .d. Xn ).4) which hinders statistical inference in these models. the sample density is (Problem 1. ?(q ? 1). where n > q. The iterative de in (1. the j s are unknown parameters.1.3 {Moving average model{ An MA(q) model describes variables (Xt ) that can be modeled as (t = 0.1. random variables "i N (0. : : :.i.3) Xt = "t + q X j =1 j "t?j . i "?i . k Lastly. the "i 's are i.13) P Z q Y "?i x0 ? q=1 i "?i ?(n+q) i ' ' IRq (1. i=2 ::: q X ^ "n = xn ? ^ i "n?i : i=1 nition of the "i 's is a real obstacle to an explicit integration ^ . the representation of the likelihood (and therefore the analytical computation of maximum likelihood or Bayes estimates) is generally impossible for mixtures. 2) and for j = 1. ?(q ?1). n) (1. involves kn elementary terms.4) with ^ ' x1 ? 1 "o ? P xn ? q=1 i ' "0 = x 0 ? ^ "1 = x 1 ? ^ q X i=1 q X i=1 Pq i=2 i "1?i ^ i "n?i d"?1 d"?q . Example 1. fp1f1 (xi ) + + pk fk (xi )g . which is prohibitive for large samples. i=1 . . Note that for i = ?q.1.

Casella and Berger 1990 or Robert 1994 for an introduction to these techniques. whatever the statistical technique. One course would be to use models based on exponential families (1. and the inferences that can be drawn from their use.8.3.2. these latter two methods using more e ciently the information contained in the distribution of the observations. distributions was often necessitated by computational limitations. and hence circumvent the problems associated with the need for explicit or computationally simple answers. Another course was to abandon parametric representations for non-parametric approaches which are by de nition robust against modeling errors. Alternative approaches (see. the computing bottleneck created by the need for explicit solutions has led to the use of linear structures of dependence (see Gourieroux and Monfort. while the method of moments can sometimes be expressed as a derivation of a maximization problem. while the likelihood is not bounded (see Example 1.2 Statistical Inference .2 ] STATISTICAL INFERENCE 7 Before the introduction of simulation-based inference. for instance. since they are convergent in most setups. (See Lehmann 1983. it can be shown that the 1.1. In econometrics. computational difculties encountered in the modeling of a problem often forced the use of \standard models" and \standard" distributions. reduction to simple.1 below. 1994). according to the Likelihood Principle (see Robert 1994).1. the former with maximization problems|and thus to an implicit de nition of estimators as solutions of maximization problems|.4 below) and therefore there is no maximum likelihood estimator.2) (see Lehmann. perhaps non-realistic.1). that is as a di erence equation.3. Brown. Note however that such an interpretation is rare and also that the method of moments is generally sub-optimal when compared with Bayesian or maximum likelihood approaches. in the case of normal mixtures. For instance. But the moment estimators are still of interest as starting values for iterative methods aiming at maximizing the likelihood. In their implementation. Gourieroux and Monfort 1996) involve solving implicit equations for methods of moments or minimization of generalized distances (for M-estimators). which enjoy numerous regularity properties (see Note 1. 1986 or Robert. these approaches are customarily associated with speci c mathematical computations. 1989. 1983. but it is also the case that the reduction to simple distributions does not necessarily eliminate the issue of non-explicit expressions. Berger 1985. Our major focus is the application of simulation-based techniques to provide solutions and inference for a more realistic set of models.) As previously mentioned. The statistical techniques that we will be most concerned with are maximum likelihood and Bayesian methods. Approaches by minimal distance can in general be reformulated as maximizations of formal likelihoods as illustrated in Example 1. the later with integration problems|and thus to a (formally) explicit representation of estimators as an integral. 1995).

minimization of (1. The parameter (a. k Although somewhat obvious. Estimation of the ij 's by minimizing the sum of the (yij ? ij )2 's is possible through the (numerical) algorithm called \pool-adjacent-violators" and developed by Robertson et al. and the variables "i represent errors. where the means are increasing in i and j.8 INTRODUCTION 1. b) is estimated by minimizing the distance (1.2.4).1. . yielding the least squares estimates. In the particular case of linear regression we observe (xi . 2).2. i = 1. For example. Yi). . i = 1. or equivalently that the linear relationship IE Y jx] = ax+b holds.2. the log-likelihood function for (a.1) Yi = a + bxi + "i . In this latter case the additional estimator of 2 is consistent if the normal approximation is asymptotically valid. i=1 1.2. b) is proportional to n X log( ?n ) ? (yi ? axi ? b)2 =2 2.2) n X i=1 (yi ? axi ? b)2 in (a.2. in (1.3 Likelihood Methods The method of maximumlikelihood estimation is quite a popular technique for deriving estimators. the likelihood structure also provides an estimator of 2 . n. in particular that "i N (0.1 {Least Squares Estimators{ Estimation by least squares can be traced back to Gauss (1810) and Legendre (1805) (see Stigler 1985). from a computational point of view. if. Yi jxi N (axi +b. : : :. and it follows that the maximum likelihood estimates of a and b are identical to the least squares estimates. Therefore. independent (equivalently.2) we assume IE("i ) = 0.) An alternative is to use an algorithm based on simulation and a representation by a normal likelihood of the problem (see x5.2) is equivalent. (1988) consider a p q table of random variables Yij with means ij . However. (See Problems 1. where (1. (1988) to solve this speci c problem. Starting from an iid sample X1 . If we add more structure to the error term. and applies in many other cases. b).18. 2)).17 and 1.3 solution of the likelihood equations which is closer to the moment estimator is a convergent estimator (see Lehmann 1983). n. in the case where the parameters are constrained Robertson et al. this formal equivalence between the optimization of a function depending on the observations and the maximization of a likelihood associated with the observations has a nontrivial outcome. to imposing a normality assumption on Y conditionally on x and applying maximum likelihood.2. Example 1. Xn from a .

) distribution is a particular case of exponential family since its density.4).12). k ). .3.2. In the context of exponential families.3) x = r f ^(x)g . there are settings where . The value of . : : :. ?( f(yj .1 {Beta MLE{ The beta Be( .. : : :. distributions with density (1. Example 1. In practice. by its construction. log(1 ? y)).1) f(x j .3. Equation (1.3) is then log y = ( ) ? ( + ).4) is then replaced . : : :. say ^. x = (log y. is known as a maximum likelihood estimator (MLE). it may still be the case that the solution of (1. in the sense that the MLE is converging almost surely to the true value of the parameter. ).3. ) = ?( ) + )) y ?1 (1 ? y) ?1 . x 2 IRk . e. : : :.3. k jx) = L( 1 .3.2) f(x) = h(x) e x? ( ) . Yn since (1.3. While it may seem absurd to estimate both parameters of the Be( . the likelihood function is L( 1 . The justi cation of the maximum likelihood method are primarily asymptotic. with = ( .2).3. xnj ) taken as a function of . ?( can be written as (1.3). Berger and Wolpert 1989). Notice that.3.3. which also is the equation yielding a method of moments estimator. ) distribution from a single observation. which is the parameter value at which L( jx) attains its maximum as a function of . that is.3 ] LIKELIHOOD METHODS 9 population with density f(xj 1 .4) log(1 ? y) = ( ) ? ( + ) . (1.3. with x held xed.1. Even if that can be done. the approach by maximum likelihood is straightforward. the log-Laplace transform. : : :. Y . k jx1. since IE X] = r ( ).1. k ): = i=1 i 1 More generaly. or there are constraints on such that the maximum of (1.3) is not explicit.2) is not a solution of (1. although it can also be interpreting as being at the fringe of the Bayesian paradigm (see. xn) Yn (1. the formal computing problem at the core of this example remains valid for a sample Y1 . or cumulant generating function of h. the range of the MLE coincides with the range of the parameter. : : :. : : :. The maximum likelihood estimator of is the solution of (1. the likelihood is de ned as the joint density f(x1 .g. 0 y 1.3. cannot be computed explicitly.3. where (z) = d log?(z)=dz denotes the digamma function (see Abramowitz and Stegun 1964).1. when the xi's are not iid. under fairly general conditions (see Lehmann and Casella 1997 and Problem 1. There is no explicit solution to (1. This last situation occurs in the estimation of the table of ij 's in the discussion following Example 1.

y>p. may be quite involved. is still well de ned.3. an observation Y = kX k2 which has a non-central chi-squared distribution. the nuisance parameters are the angles in the polar representation of (see Problem 1. . leads to a maximum likep lihood estimator of which di ers2 from Y . which has a constant bias equal to p. Many other options exist.) Example 1. Ip ) and if = k k2 is the parameter of interest. and use the resulting ^ to estimate . where is a nuisance parameter. this does not require more complex calculations although the distribution of the maximum likelihood estimator of . ^).3. such as conditional. k When we leave the exponential family setup. a typical approach is to calculate the full MLE ^ = ( ^.2 {Noncentrality Parameters{ If X Np( . 2 ( ) (see Appendix 1).14) Z I (t) = p (z=2) 1 et cos( ) sin2 ( )d ?( + 2 ) 0 1 t X (z=2)2k = 2 k=0 k!?( + k + 1) (see also Abramowitz and Stegun 1964). we face increasingly challenging di culties in using maximum likelihood techniques. If the parameter vector is of the form = ( . the maximum likelihood estimator of .5) requires us rst to evaluate the special functions Ip=2 and I(p?1)=2 (see Saxena and Alam 1982). Note also that the maximum likelihood estimator is not a solution of (1.3.2) and the maximum likelihood estimator of is ^ (x) = kxjj2.1. In principle.3. ).3 with 1 X log(1 ? y ) = ( ) ? ( + ) : i k n i When the parameter of interest is not a one-to-one function of . So even in the favorable context of exponential families we are not necessarily free from computational problems. where I is the modi ed Bessel function (see Problem 1. ^. that is when there exists nuisance parameters. 2 This phenomenon is not paradoxical as Y = kX k2 is not a su cient statistic in the original problem. One reason for this is the lack of a su cient statistic of xed dimension outside exponential 1 X log y = i n i ( ) ? ( + ). Surprisingly.5) when y < p (see Problem 1.5) y = py Ip=2 y .24). marginal. since the resolution of (1. (See Barndor -Nielsen and Cox 1994. or pro le likelihood.10 INTRODUCTION 1. since it is the solution of the implicit equation p p p I(p?1)=2 (1.

1973) Consider u1 . 1. p=2. jjxjj2=2) .49 Show that. (c) Show that the posterior distribution ( jx) cannot be written as a posterior distribution for z f (z j ).50 p that the Bayes estimator of = jj jj2 under quadratic loss for ( ) = Show 1= and x N ( . ) = ( ) and show that it only depends on z = (z2 . The prior distribution is ( . 2 ). 1] by considering the natural transformations = log( ) and = log(%=(1 ? %)). (Hint: Consider. : : : . ) = 1: 1 2 1.7 ] PROBLEMS 31 erarchical levels does not modify the conjugate nature of the resulting prior if conjugate distributions with xed scale parameters are used at every level of the hierarchy. How do you explain this phenomenon? (d) Show that the paradox does not occur when ( . whatever ( ). such that the rst of these variables has an E xp( ) distribution and the n ? other have a E xp(c ) distribution.52 (Dawid et al. ) = ( ) ?1 . Ip ) can be written as 2 (x) = 1 F1 (3=2. Deduce from the series development of 1 F1 the asymptotic development of (for jjxjj2 ! +1) and compare with 0 (x) = jjxjj2 ? p. 2 ). ) = (2jjjjjj2? ) +p and conclude. it is still impossible to derive ( jx) from f (z j ). (a) Give the shape of the posterior distribution of when ( . . for instance. (a) Show that the posterior distribution ( jx) only depends on . but that a paradox occurs. : : : . u2 N ( 2 . where c is known and takes its values in f1. 1.48 Show that a Student's t-distribution Tp ( . f (z j ). even though ( jx) only depends on z . 1 F1 (1=2. n?1g. 1. u2 . p and = ( 1 ? 2 )=( 2) is the parameter of interest. 1973) Consider n random variables x1 . the normal case. p=2.) 1. 1. (b) Show that the distribution of z .51 Assuming that ( ) = 1 is an acceptable prior for real parameters. although it only depends on z . 2. s2 2 2 = .1. show that this generalized prior leads to ( ) = 1= if 2 IR+ and to (%) = 1=%(1 ? %) if % 2 0. with zi = xi =x1 .53 (Dawid et al. s2 such that u1 N ( 1 .1. only depends on . a multiplication of the number of hi- ? z = u1 p u 2 : s 2 (b) Show that the distribution of z only depends on . apart from the trivial family F0 . . : : : . for exponential families. jjxjj =2) where 1 F1 is the con uent hypergeometric function. zn ). 2 ) does not allow for a conjugate family. xn . Study the behavior of these estimators under the weighted quadratic loss jj 2 2 L( .

) / y!(z ? y) . (b) The parameter of interest is now = 1 . (b) Show that. 2 0 with = ( . )= 1 : (a) The parameter of interest is = ( 1 . x2 j ) / t2n?1 exp ? 1 t2 + n(x1 t ? )2 + n(x2 t ? )2 dt. 0 y z. ). : : : . z j . ) only depends on and derive the distribution f (y. Give the value of p which avoids the paradox. : : : . ? )! with 0 < < 1. 1.54 (Dawid et al. 1. (b) Extend this result to overparametrized models with improper priors. ) d = ( ) and examine whether the paradox is evacuated.56 (Jaynes 1980) Consider z y (1 z?y f (y. x2n N ( 2 . (a) Show that ( jx) only depends on x1 and that f (x1j ) only depends on . with ( . 1973) Consider 2n independent random variables. : : : . 1. 2 = ) and the prior distribution is ( 1 . ). (a) Show that f (z j . 0 ). ) from f (yjz. (b) Show that. x11 . xn N ( + . 2 ). Derive the value of p which avoids the paradox. for every ( ). 1.55 (Dawid et al.1. Assume that ( jx) only depends on z and that f (z j ) only depends on . 1973) Consider (x1 . (c) Consider the previous questions when P a( .57 (Dawid et al. 2 ): . R (b) Generalize to the case where ( . 2 ). ) = ?p : Show that ( jx) only depends on z = (z1 . but that ( jx) cannot be obtained from x1 f (x1 j ). x1n N ( 1 .58 Consider x1 . ) / 1= . z j .32 INTRODUCTION 1 2 2 1. 2 . ). x2 ) with the following distribution: Z +1 h i f (x1 . 1973) Consider x = (y. Justify this distribution by considering the setting of Exercise 1. (a) Show that the paradox does not occur if ( ) is proper. z2 ) = (x1 =s. . x2 =s) and that the distribution of z only depends on . The prior distribution on is ( ) = 1. .8 (c) Show that the paradox does not occur when ( .54. (a) Show that the posterior distribution is not de ned for every n. the paradox does not occur. z ) with distribution f (xj ) and = ( . Show that ( jx) only depends on z1 and that f (z1 j ) only depends on . 1. 2 ) = ( 1 = . for any distribution ( ) such that ( jx) only depends on x1 . x21 . ( jx) cannot be proportional to ( )f (x1j ).

that is such that. for reasons related to the Pitman{ Koopman lemma (see Robert.1.8. which is equal to r ( ). + 1).1) which also allows for conjugate priors contains exponential type densities with parameter dependent support. ). are of particular interest. k ).1 Conjugate priors When prior information about the model is quite limited. Binomial B(n. ) with Beta Be( . ) e : ? ( ). if the sampling density is of the form (1. As mentioned above. 1994).8. ) is x0 and. the posterior distribution ( jx) also belongs to F . ). since the posterior distribution is ( j + x.2 Gray codes . while they can only be found in exponential families. But the main motivation for using conjugate priors is their tractability. ) = K ( . for every 2 F . These families are also called conjugate and another justi cation found in Diaconis and Ylvisaker (1979) is that some Bayes estimators are then linear.8. k ) with Dirichlet D( 1. the prior mean of ( ) for the prior ( j . In fact. if x1 . 1986). like the uniform or the Pareto distribution.d. Families F that are closed under sampling. and N ( . ) with Gamma G ( . ). : : : . Multinomial Mk ( 1 .1) f (xj ) = C ( )h(x) expfR( ) T (x)g. 1= ) with Gamma G ( . 2 ) distributions are associated with normal N ( . 2 ) conjugate priors. ). Poisson P ( ) with Gamma G ( . : : : .i.8 Notes 1. If ( ) = IE x].1. IE ( )jx1. : : : . a conjugate family for f (xj ) is given by ( j . conjugate priors provide linear estimators for a particular parameterization.8 ] NOTES 33 1. xn ] = x0 + nx : +n 1. normal N ( . f (xj ). which include many common continuous and discrete distributions (see Brown. In particular. the prior distribution is often chosen in a parametered family so as to keep the subjective input as limited as possible. Gamma G ( . An extension of (1. : : : . xn are i.8. for both parsimony and invariance motivations.

.

1. we concentrate on the generation of random variables that are uniform on the interval 0.1. F . 1]. 1] to X . 1].1 Simulating Uniform Random Variables 2. for example. in turn. The type of random variable production is formalized below in the de nition of a pseudo-random number generator.1) F ?(u) = inf fx. In this chapter. F represents a -algebra on . based on the production of uniform random variables. 1]) and therefore equate the variability of ! 2 with that of a uniform variable in 0.1 February 27. it is always possible to represent the generic probability triple ( .1 Introduction Methods of simulation are based on the production of random variables. often independent random variables. transformed by the generalized inverse. We thus provide in this chapter a particular uniform generator. 1] (see for instance Billingsley.1.1. Examples 1. We also give an introduction to approximation methods for densities and their connections with simulation. This generation is.1] provide the basic probabilistic representation of randomness. to produce random variables from both standard and nonstandard distributions.1. U 0.1]) (where B are the Borel sets on 0. because the uniform distribution U 0. 1995).2 and 1. De nition 2.3). in describing the structure of a space of random variables.1 For a function F on IR. F(x) ug : . F ? . that are distributed according to a distribution f that is not necessarily explicitly known (see. 1]). 2. and P is a probability measure) as ( 0. The random variables X are then functions from 0. along with standard generation methods.1. the generalized inverse of F.1. 1998 The methods developed in this book mostly rely on the possibility of producing (with a computer) a supposedly endless ow of iid random variables for well-known distributions. In fact.1. P) (where represents the whole space. B( 0.CHAPTER 2 Random Variable Generation and Computational Methods Version 1. is the function de ned by (2.

1. sometimes known as the Probability Integral Transform. This is because. while it somehow clari es the usual introduction of random variables as measurable functions. f(u. 1]) : Therefore. there is no reproducibility of such samples. Before presenting a reasonable uniform random number generator. in order to generate a random variable X F. Proof. (Although. x). there are methods that use a fully deterministic process to produce a random sequence in the following sense: Having generated 1 Von Neumann (1951) summarizes this problem very clearly by writing \Any one who considers arithmetical methods of reproducing random digits is. of course.2. second. x). indeed. F(x) ug and P(F ?(U) x) = P(U F(x)) = F(x) : 2 Thus.2 If U U 0. then the random variable F ?(U) has the distribution F . and whether it is. For our purposes.1] and then take the transform x = F ?(u). 1978). (Techniques based on the physical imitation of a \random draw" using the internal clock of the machine have been ruled out. we rst digress a bit to discuss what we mean by a \bad" random number generator.) More importantly. the generalized inverse satis es F(F ?(u)) u and F ?(F(x)) x .1. Dellacherie. since those distributions can be represented as a deterministic transformation of uniform random variables.1. this basic representation is usually a good way to think about things. in practice. rst.1 We then have the following lemma.1]. we often use methods other than that of Lemma 2." .36RANDOM VARIABLE GENERATION AND COMPUTATIONAL METHODS 2.2. Lemma 2. it su ces to generate U according to U 0. which gives us a representation of any random variable as a transform of a uniform random variable. As has been pointed out several times. The generation of uniform random variables is therefore a key determinant in the behavior of simulation methods for other probability distributions. formally. possible to \reproduce randomness" (see. and a strict arithmetic procedure of course is not such a method. for example.1] . 1] which imitates a sequence of iid uniform random variables U 0. in a state of sin. F ?(u) xg = f(u. 1]. Lemma 2. The logical paradox1 associated with the generation of \random numbers" is the problem of producing a deterministic sequence of values in 0. there is no guarantee on the uniform nature of numbers thus produced and. for all x in F ?( 0. there is no such thing as a random number|there are only methods of producing random numbers. For all u 2 0.) But here we really do not want to enter into the philosophical debate on the notion of \random".2 shows that a bad choice of a uniform random number generator can invalidate the resulting simulation procedure.

Un leads to acceptance of the hypothesis H : U1. . there is neither ergodicity nor convergence of the distribution of Xn to the uniform distribution. such as the Kolmogorov-Smirnov test. .) With these limitations in mind. produces a sequence (ui ) = (Di (u0 )) of values in 0. . . Ui?k ).2. This de nition is clearly restricted to testable aspects of the random variable generation. X1n). Xn) is always the same. . which are connected through the deterministic transformation ui = D(ui?1 ). . Xn).2. . This limitation should not be forgotten: the validity of a random number generator is based on a single sample X1 . (X21 . Many generators will be deemed adequate under such examination. given the initial value X0 . in the sense that the transition kernel is equal to a Dirac mass. nor identically distributed. we can now introduce the following operational de nition. it is still possible to speak of stationary distributions in these deterministic setups. q) model. Dellacherie (1978) gives a more mathematical treatment of this subject plus a historical . Vn) of uniform random variables when compared through a usual set of tests. X2n).1. Xn when n tends to +1 and not on replications (X11. . (Surprisingly. Marsaglia has also assembled a set of tests called Die Hard. Of course. In addition. applying them on arbitrary decimals of Ui . like those of Lecoutre and Tassi (1987). Thus. because the variables (Xn ) always form a trivial Markov chain.4. For all n. Xn ) and (Y1 . . by using an ARMA(p. . . : : :(Xk1 . the \pseudo-randomness" produced by these techniques is limited since two samples (X1 . since the distribution of Xn given X0 does remain a Dirac mass for every n. knowledge of Xn or of (X1 .1 ] SIMULATING UNIFORM RANDOM VARIABLES 37 (X1 .3 A uniform pseudo-random number generator is an algorithm which.1.1] : The set of tests used is generally of some consequence. one can use methods of time series to determine the degree of correlation between between Ui and (Ui?1 . the distribution of these n-tuples depends on the manner in which the initial values Xr1 (1 r k) were generated. and perhaps more importantly. Un are iid U 0. the values (u1 . for instance. De nition 2. There are classical tests of uniformity. Yn) produced by the algorithm will not be independent. the validity of the algorithm consists in the veri cation that the sequence U1 . . which avoids the di culties of the philosophical distinction between a deterministic algorithm and the reproduction of a random phenomenon. In particular. 1]. as shown in Example 2. It is also the case that the random number generation methods discussed here are not directly related to the Markov Chain methods discussed in Chapters 6 and 7. un) reproduce the behavior of an iid sample (V1 . the sample (X1 . In fact. One can also use nonparametric tests. . Xkn) where n is xed and k tends to in nity. . nor comparable in any probabilistic sense. starting from an initial value u0 and a transformation D. Thus. Xn)] imparts no discernible knowledge of the value of Xn+1 .

Example 2. where algorithms resistant to standard tests may exhibit fatal faults. reputed to be \good".1. 2(1 produces a sequence (Xn ) that tends to U 0. the tent function progressively eliminates the last decimals of Xn .5).1] (see Problem 2. (For instance.2. 1976.1. that an algorithm of Wol (1989).) Moreover. k Although the limit (or stationary) distribution associated with a dynamic system Xn+1 = D(Xn ) is sometimes de ned and known. 1994) are based on dynamic systems of the form Xn+1 = D(Xn ) which are very sensitive to the initial condition X0 .5 (Continuation of Example 2. for example. 1] 0.1. even when these functions give a good approximation of randomness in the unit square 0. results in systematic biases in the processing of Ising models (see Example 5. Classic examples from the theory of chaotic functions do not lead to acceptable pseudo-random number generators. chaotic con gurations.1. for instance. the hypothesis of randomness is rejected by many standard tests.2. In particular. 1989.4) Figure 2.1 review of successive notions of random sequences and the corresponding formal tests of randomness. This methodology is not without problems. if x 1=2. 1] (see Example 2.1. algorithms having hidden periodicities (see below) or which are not uniform for the smaller digits may be di cult to detect. for some values of 2 3:57.5).4 has a disastrous behavior. or Guegan. D(x) = 2x ? x) if x > 1=2. such as those of Martin-Loef (1966). the value = 4:00 yields a sequence (Xn ) in 0. Berge.38RANDOM VARIABLE GENERATION AND COMPUTATIONAL METHODS 2.1 illustrates the properties of the generator based on D . 1] that. as the theory of large deviations (Bucklew. has the same behavior as a sequence of random numbers (or random variables) distributed according to the arcsine distribution with p density 1= x(1 ? x). 1984. De nition 2. The histogram of transforms Yn = 0:5 + arcsin(Xn )= of a sample of successive values Xn+1 = . the second generator of Example 2. the sequence (Xn ) sometimes will converge to a xed value. the \tent" function". or particle physics. Ferrenberg. In particular. In particular.1. Gleick. The notion that a deterministic system can imitate a random phenomenon may also suggest the use of chaotic models to create random number generators. which result in complex deterministic structures (see Ruelle. the chaotic features of the system are not guarantees for an acceptable behavior (in the probabilistic sense) of the associated generator.4 {The Logistic Function{ The logistic function D (x) = x(1 ? x) produces. Landau and Wang (1992) show. In a similar manner. however. theoretically. These models. 4:00]. Pommeau and Vidal. Consider. Example 2.1). due to long term correlations in the generated sequence.1.3 is therefore functional: An algorithm that generates uniform numbers is acceptable if it is not rejected by a set of tests. particular applications that might demand a large number of iterations. Given the nite representation of real numbers in the computer. 1990).

along with the (marginal) histograms of Yt and Yt+100 .1. . 1. Moreover.0 0.0 0.8 SIMULATING UNIFORM RANDOM VARIABLES 39 0.1 shows that the sample of (Yn . while the plots of (Yn. 9899) for the sequence Xt+1 = 4xt (1 ? Xt ) and Yt = F (Xt).6 0.0 • •• • • • • • • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••• •••••••••• ••• •••••• •••••••••••••••••••••••••••••••••••••••• •• • •• • • •• •• • • • • • •••• • • • ••• •• • • • • •• ••• • • • •• • •• ••• •••• • •• ••• •••• •• ••••• • •••• •••• •••••••• •• •••• ••••••• • ••••• •• • ••••• •••••••••• • •••• •• • •••••• • ••• • • • • ••••• •••••••••• ••••••• •••• •••••••• ••••••••••••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••• •••••••••••••••••••••••• • • • • • • • •• • • ••••• ••••• • • ••••••• •• • • • ••••••• ••••••••••••••• ••• • • • •••• ••• ••••• •••• ••••• ••••• • • ••••••• •• ••••••• •• • • • •••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • •• •• •• ••••• ••• • ••••• •• • • ••• • • ••• • • • • •• • • • • • • • • •••••• •••• •••• • • ••••••• •••••••••••••••• ••• ••• •••••••••••••• ••••• ••••••• • • ••••••••••• ••••• •••• ••••••••••••••••••••••••••• ••• • • •••••• •• • • •• • •• • • • • • •• • ••• •• • •• • • • •• • • • •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• •••••••• • • •• • • •••• • • • • • •••••••••••••••••••••••••••• •••••••••••••• ••••••• ••• ••••••••••••••••• ••••••••••• •••••••••••••••••••••••••••••••••••••••••• •• • • • • • •• • ••• •••• • ••• •••• •• •••• • •••••••• • ••• ••• • • • •• •••••••••••••• ••••• •••• •••••••••••••••••• ••• • • • • •••••• • • •• • • •• • • • • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••• • •• • • • •• • •• •• •• • •• •••• • • ••• •• • •• •• • • •• • • • • • ••• •• • • •• • • •••• •• ••••• ••••••••••••••••••••••••••••• ••••••••••••••••••••••••• •• ••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••• ••• •••••••••••• • •••••• •• •••• •• •• ••• •• • • • •• • •• • • ••••• • • • •• • • •• ••••• • • • ••• • • •••••••••••• •••••••••••••••••••••••••••••••••• ••• •• •••••••••••••• •••• ••••••••• •••••••• •••••••••••••• •••• • ••••••• ••• ••••• • • •• • • • • • • • • • •• •• • • • •••• •• • ••••• ••••••••••••••••••••• ••• •••••••••• •••• •• •••••••••• • • •• ••••• •• •• • •••••• ••• • ••••••••• • • • •• • • • • • • •• •••••••••••••••••••••••• •••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••• • • •• •• • • • • • •• •••• •• ••• • • • •• • • • • • ••• • •• • • • •• •• • •• • ••• •••••••••• •••••• ••••••••••••••••••••• • •••••••••• •••••••• •••••• • ••• • • • •• •••• • •••• •••••••••••• •• ••••••• ••• • •• • ••• • • • •• • •• ••••• • • • ••••• ••••• ••••••• • •• • •••••• • •• •• • • ••• •••• • ••• •• • •• ••• ••• ••••• •••• ••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••• ••• ••••••••••••••••••••••••••••••••••••••••••• • •• • ••• • • • •• • •• •• • • • • • • • •••••• •• •••• •• • •• •• •• ••• •• • •• • • •• •• •••••• •••••• ••• •••• ••••• •• • •• •••• ••• •••••• •• • • •••••• • • • •••••••••••••••••• ••• •••• •• •••••••••••••••••••••••• •••••••••••••••• ••••••••••••••• •••• ••••••••••••••••••••••••••• ••• •••• ••• • •• ••• • • •• • •• • •• • •• ••• • •• • • •••• • • ••• •• • • • •••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• • •• • • •••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • •• • • • • • ••••• •• •••• • •• ••• • • •• •• •• • ••• • •• •••• •• • •••• • ••••• • ••• • • • • •• • •• • • •• • •• • • • •• ••••••• •••• • ••••• ••• •••• • • •••••••• •••• • • ••••• •••• ••••• •••• ••••• •• •••••••••••••••••• •• • ••• • ••• ••• • ••••••• •• •••••• ••• •••••••••••••••••••••••••••••• • ••••••••••••••••••••••• • ••••••••••••••••••••••••••••••••••••••••••••• • •• •• • • •• • ••• • • • • ••• •• • • •• • • • • • • ••• • •• • • ••• • • • •• • • • • • • •• •• • • • • • • •• •• •••••• •• ••••••••• ••• ••• ••• • • ••••••• • •••••••• •• •••••••••••••••• ••••••••••••• •• • •••• ••• • ••• •••• •••• •••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • ••• •• ••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••• •••••••••••••••••••••••••••••••••••••••••• • • • • • • • •• • • ••• ••• • •••• • ••• •• •• • • • • •• • • • • • • •••• ••••••• ••• •• •• • • •••• • • ••••••••••• • • • •• •• •••••••••••••••• ••••• ••• •••••••• •••••••• •••• •••••••••••••••• ••••••••••••••••• ••••••••••• •••• ••• ••••••••••• • • •• • • • • ••• •••• • • • • • • • • •••••• • ••••••••• •••••••••••••• •••••••••••• •••••••• •••• ••• •••••• •••••••••••••• •• •••• • •••••••••• ••••••• • • ••••• ••• ••••• •• • • • •• • • •• •• • • • • • •• • •• • • • • • • • • • •••• • •• • • • • • •• • •••• •••••• ••••• •• •• •• • ••• • ••••••• • ••••• • ••••• • • •• •• ••••• ••••• • ••••••• •• • ••• • •••••• ••••••• •• • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••• • • • • • • •••••••• •••••••••••••• •••••••••••••••••••••••••• ••••••••••••• •••••••••••••••••••••••••••••••••• •••••• ••••••• •••••••••••••••• •••• • • • • • •• ••••••• • •••••••• •••• •••••••••••••••••••••• • ••• •••••••••••• • •••••• ••••••• ••••• ••••••••• ••••• ••• • ••••••• • •• •••••••• • • •• • ••• • • • • • •• • • • ••• • • •• •• • •• • • • • • ••••• •••• •••••••••• ••••••• •••••• ••••••• •• •••••••• ••• ••• • • •• • • •• • ••••• ••• • •••••••• •••• •• ••••••• ••• •• ••• • • • •••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• • •• •••••••••••• •••••••• •••••••• •••• ••••••• ••••••••••••••••••••• ••• ••••• •••• •• •••••• ••• •••••• ••• ••• • • • •• •• • • • •• • ••••• • • • • • • • • • • • • • • •• • • ••• •• ••••••••••• •••••••••••••••••••••••••••••••••• •••• •• ••••••••••••••••• •• •••• ••••••••••• •••••• ••••••••• ••••• •••••••• • •• •• • • •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••• ••• •••• ••• ••• • • •• ••••••• • •• •• ••• ••• •••• •• •• ••• •••••• ••••• • ••• ••• ••••• •••• • •• •• •• • • •• • • •••••••• •• •• •••••••• •• •• •••• •••• • • •••• • • •••••••••• •• • ••• ••• •• ••• • • •• •• •• • ••• • •• ••• • • • • • • • • •• • •• • •• • • • •• ••••• • • • • • • •• • • • • •• • • • • ••• • • • • •••• •• • ••• •• ••• • ••••• • •• • • ••• • • ••• • ••••• •• ••••• • •••••••••••••••••••••••••••••••••••••• ••••••••• ••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••• •••••••••••• ••••••••••••••••••••••••• ••••• •••• ••• • •• ••••••• ••••••••••••• •••••••••••••••• ••••••••••• •••••••••• •••• ••••••••• ••••••••• • • •••••• •••••••••••••••••••••••••••••••••••••••••• •••••••• •••••••••••••••••••••••••••• ••• ••••• •••••••••••••••••••••••• • ••••• • •• • ••••••••• •••••• •••• ••••• ••••••••••• •••• •• •••• •••••••••••••••••••• ••• •••• •• ••• • ••• •• •••••••••••••••• • • •• • • • •••• •• • ••••• •••••••• • •••• •• ••••• ••••• •••••••••• •• •••••• •• •• •••••• • • ••• ••• •••• ••••••••••• ••• •••• • • • • • • • ••••• •••••• • •• • • •• •• • ••• •• • • • •• • • •• •• ••• • • • ••• • •••• • • •• • ••• • •• • • • • • • • •• •••••• •• ••••• • • • • • • • • • • • • • • •• •• • • • •• • •• • • •••••••••••••••••••••• •••••• •••••••• •••• •••••••••••••••••••••• • • •••••••••••••••••••••••••••• •••••••••••••• •• ••••••• •• • •• • •• • • ••• •• •• •• • •••• •• • • •• ••••• •••• ••••••••••••• • ••••••• ••••••• •• •••• •• • ••• •• • •• •••••••••••••••••••••• •••• • •• • • • • • • ••• •• • •• • • • • •• •• •••• • •• • • • •• • •• • • ••• •• •• • • •• • • • • •• •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••• •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • ••• ••• ••••••• • • •• •• • ••• •••••••• •••• •••••• •• • • • • •• •••• • • ••••• • • •• •••• •• • ••• • • •• • • •• • • • •••••••••••••••••••• •••• ••••••••• ••••••••••••••••• ••••••••••• •••••••••••••••• ••••••••••• ••• •• •••• •••• ••• • •••• ••••••• •• • • • • • • • • •• • •• • • • • •• •• • • •• • •• • • • ••• •• • •• • •• • • •• •• •• • • •• • •••• • •• • •• • • •• •• • •••••••••••••••••••••••••••• •••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••• • • • • • • • • • • • •• ••••••••••••••••••••••• •••••••••••••••• •••••••••••••••••• ••••••••••••••••••• ••••••••••••• • ••••• •••• •• •• •••••••••••••• ••• •• • • • • • •• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••• • •• •• ••• •• • •••• •• • • • •••••••••••••••••••••• ••••• • ••••••••••••••• •• •••••• •• ••••• •••••••••••• ••••••••••••••••• •• ••••• • ••• • •••• • • •••• • •••••• • • ••••• • •• •••• • • •• •• •••••••••••••••••• • •••• •• • •••••••••••••• •• • • ••••• • • • •• ••••• •• • ••• • • ••• • ••• • ••• •• •• • • • •• ••••••••• •• •••• ••• ••• ••• •• ••••• • •••• • • • • •• •• • •••• •• •• • • •• • • • •• • ••• •• •• • ••••• •• ••• ••• •••• ••• ••• •• •• •• ••• • •••••••• • • ••• • • • • •••••• • •• • • • •• ••••••• ••••••••• •••••••• •••••••• • • • ••• ••• •• •• •••• • •• • • • ••••• ••• • ••• •• •• ••••• •••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••• •••••••••• ••••••••••••• ••••••••••••••••••••••••••••••••••••• • • • •• • • •• • • • ••• ••• ••••• •••• • • ••• • •• • •• •• • • • • • ••••••••••• ••• •••••• •••• • ••••• • •••• • •• • •••• • •• • •• • •• • • • ••• • • • ••• ••• • •••••• • ••••• • • •• •• • ••• • •••••••••••••••••••••••• ••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••• •••••••• • ••••• •••• ••••••• •• •••••• • ••••• •• • ••• •• ••• •••• ••• • •••• • •• • • •• •••• ••• • ••• • •••• ••• ••••••• ••• ••• • • •••• •••• • •••• •••••••••• • • • •• ••• • •••• •• •••••• •• • • • • • • •• • • •• ••• •• • • •• • •• •• • • • • • • •••• • ••• •• ••• • • • ••• • • • • • • •• •••• •• • • •• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••• ••••••••••••••••••••••••••••• • • •• •• •• • • •• ••••••••• ••• ••• • • ••• • •• ••••••• •• •• • • •• • • ••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • •• •• • • •• • ••• • • • ••• •• • •• •• •• • •• • • • ••••••• ••• • • • • • • • • •• • •• ••• • • • •• • • • •• • ••••••••••••••• •••••••••••• •••••••••••••••••••• •••••••••••••••• • ••• •••••••••••••••••• ••••• ••• ••••••••••• •• •••••••••• • • •••• •••••••••••• •• ••• •• ••• •••• •• • ••••• • •• • •• •••• • •••• ••••••••••• • •• ••• ••• ••••• •••• • •• •• •••• • • •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • ••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••• •••••• •• •• •••••••••••••• ••••• •• ••• ••• • •••• ••• • ••• • ••••• • •• • •• • • • •• • ••• •••••••• • ••• •• ••• • •• ••••••• ••• • • ••••• •••• •••••••• ••••••••••••••••••••••••••••• •••••••••••••• •••••••••• • •••••• •••• •••••••• •••••••• ••••••• •••• ••••••••••••••••• • ••••• • ••• • • • • • • • •• • • • • • • • ••••• • • • • • • • ••••••••••• •••• •••••• • ••••• ••••• •••••••••••• ••••• •••••••••• •••••••• •••••••••••••••••••••• • ••••••••• ••••••••• •••• • • • • •• • • ••• ••• • •• •• ••• • •• • •••••• ••• • • • ••••••• • • • •• ••••••••••• • •• • • • • •••••••• ••••• •• •••• ••••• ••• ••••••••••••• •••••••• •••••• • •••••• •••••• ••••••• • •••••••••• ••••• • • • • • • •• • • • • ••• • ••• ••••••••••••••••• • •••••••• •••• ••• •••• •••• • •••• • •• •• • •• • •• • ••• • • •••• • • •••• •••••• • • •• • • ••• •••••••••••••••••••••••••• ••••••••••••••••••••••••••••• •••••••••••••••• •••••• •••••••••••••••••••••• •••• •••••• ••••• •••••• • ••••••• •••• ••• ••••••••• •• •• • •••••• ••••••••• •••• •••• ••••••• • •• ••• ••• • ••• • ••••• •• • •••• ••• •• •••••••• ••• • • •• • • •• • • • • • • •• ••• • • •• • ••• • • •• • ••• • ••• • • • • •••• • •• • • • • •• • •• •• • • •• •• ••• • • 0. Figure 2. Plot of the sample (Yt .6 0. After all. 1] but rather on the integers f0.6 0. where M is the largest integer accepted by the 2 The name is an acronym of the saying Keep it simple. this is a statistics text! .8 1. Yn+1 ) and (Yn. Yt+100 ) (t = 1.4 0. Ripley (1987) and Fishman (1996) are excellent sources. Yn+10) do not display characteristics of uniformity.0 0.2 0.. M g.2. : : : .4 t 0.2 Figure 2.0 0.2 0. Preferred generators are those that take into account the speci cs of this representation and provide a uniform sequence. instead of a catalog of usual generators. It is important to note that such a sequence does not really take values in the interval 0.0 0.8 t+100 0.0 0. we only present a single generator.. For those.0 1.2 0. the books of Knuth (1981).8 1.. Rubinstein (1981).1. Yn+100) satisfactorily lls the unit square. But the 100 calls to D between two generations are prohibitive in terms of computing time.0 0.4 0. and not re ective of more romantic notions.2. stupid!.2 0.1.4 0.4 0. k We have presented in this introduction some necessary basic notions to now describe a very good pseudo-random number generator.1.1 ] 0.8 1. 2. the algorithm Kiss2 of Marsaglia and Zaman (1993). the nite representation of real numbers in a computer can radically modify the behavior of a dynamic system.0 0.6 0.2 The Kiss Generator As we have remarked above.8 1. To keep our presentation simple. D (Xn ) ts the uniform density extremely well.

56 Establish (2. 2. there exist such sequences. K ( ) = ? log( ). : : : .6. xn ) leads to the Kolmogorov-Smirnov test in nonparametric Statistics. Xn be iid from the Pareto P a( . ) distribution with known lower limit .u] (xi) ? u : u i=1 This is also the Kolmogorov{Smirnov distance between the empirical cdf and that of the uniform distribution. As shown in Niederreiter (1992). the saddlepoint is given by f (xj ) = x +1 . The corresponding density is x > .e.55 Show that the quasi-Monte Carlo methods introduced in x2. which ensure a divergence rate of order O(n?1 log(n)d?1 ). Since it can be shown that the divergence is related to the overall approximation error by (2. xn ) . 2.6 ] NOTES 75 (d) Let X1 .6 Notes 2.1 Quasi-Monte Carlo methods Quasi-Monte Carlo methods were proposed in the 1950's to overcome some drawbacks of regular Monte Carlo methods by replacing probabilistic bounds on the errors with deterministic bounds.1) n i=1 n 1 X h(x ) ? i Z f (x)dx V (h)D(x1 . n (1? = ^) 1 : ^ e ^ 2 Show that the renormalized version of this approximation is exact.1 lead to standard Riemann integration for the equidistributed sequences in dimension 1. i. : : : . xn?1 do not depend on n. Show that X S = ? log Xi . : : : . f ( ^j ) h n i1=2 2. : : : . and the saddlepoint approximation is ?n ^= t s + n log( ) . : : : . where d is the dimension of the integration space.1) and show that the divergence D(x1.6. : : : . The idea at the core of quasi-Monte Carlo methods is to substitute the randomly (or pseudo-randomly) generated sequences used in regular Monte Carlo methods with a deterministic sequence (xn ) in order to minimize the so-called divergence n 1X D(x1. such that x1 .6. xn ) for all n's. the solution is obviously xi = 2i?1 in dimension 1 but the goal here is to get 2n a low-discrepancy sequence (xn ) which provides small values of D(x1 . used in non-parametric tests. and can thus be updated sequentially.2.2. For xed n. xn ) = sup n II 0.6.

4). can be quite involved. Construction of these sequences. See Niederreiter (1992) for extensions in optimization setups. 1978). where V (h) is the total variation of h. . V (f ) = Nlim !1 the gain over standard Monte Carlo methods can be substantial since standard methods lead to order O(n?1=2 ) errors (see Chapter ??. The true comparison with regular Monte Carlo methods is however more delicate than a simple assessment of the order of convergence.6 (see Niederreiter 1992). The advantage over standard integration techniques such as Riemann sums is also important when the dimension d increases since the later are of order nd?2 (see Yakowitz et al. although independent from h.2. x3. the construction requires that the functions to be integrated have bounded support.76RANDOM VARIABLE GENERATION AND COMPUTATIONAL METHODS 2. x0 =1 ::: xN =1 j=1 sup N X jh(xj ) ? h(xj?1 )j . More importantly. which can be a hindrance in practice.

and Examples 3. may involve the integration of the empirical cdf (see x1.4.1. Moreover. Bayes.2. 1998 There are two major classes of numerical problems that arise in statistical inference.1).1 . 1.1. other statistical methods. although unrelated to the Bayesian approach.1.1 Introduction Z L( . This necessitates an approximate solution of (3. ) and the prior is the solution of the minimization program (3. as shown by Examples 1. whatever the type of statistical inference. The previous chapter has illustrated a number of methods for the generation of random variables with any given distribution. Thus. Examples 1. (An associated problem.CHAPTER 3 Monte Carlo Integration Version 1. Similarly.).1) either by numerical methods or by simulation.5).4.2 February 27.4. ) ( ) f(xj ) d : Only when the loss function is the quadratic function k ? k2 will the Bayes estimator be a posterior expectation. optimization problems and integration problems.1) in terms of ( jx) (see for instance Robert 1994a or 1996c for the case of intrinsic losses). and integration with the Bayesian approach. While some other loss functions lead to general solutions (x) of (3.4 have also shown that it is not always possible to derive explicit probabilistic models and even less possible to analytically compute the estimators associated with a given paradigm (maximum likelihood.3. In general. On the other hand.1 { 1.3. we are led to consider numerical solutions.1.) Although optimization is generally associated with the likelihood approach. Bayes estimators are not always posterior expectations. method of moments. alternatives to standard likelihood. such as bootstrap methods. and . that of implicit equations can often be reformulated as an optimization problem.1) min 3. these are not strict classi cations.1. etc. such as marginal likelihood.1. the Bayes estimate under the loss function L( . a speci c setup where the loss function is constructed by the decision-maker almost always precludes analytical integration of (3. may require the integration of the nuisance parameters (Barndor -Nielsen and Cox 1994).1.1.

3) shows that the associated Bayes estimator satis es Example 3. L( . 2 = sin('1 ) cos('2 ). ai+1). 'p?1 are the polar coordinates of .78 MONTE CARLO INTEGRATION 3. (x). ) = wi( ? )2 when ? 2 ai. the computation of (x) requires the computation of the posterior means restricted to the intervals ai . ) = j ? j. ai+1 ). '1.1. Ip ).3.3. or Huber's (1972) loss function.1. which is the solution to the equation (3. that is when = k k2 and X this equation is quite complex. ai+1 ).3) L( . before embarking on a description of simulation based integration methods.1.1 {L1 loss{ For 2 IR and L( . : : :.1. 1 where .1. ) = 2( cfj ? j ? c=2g otherwise. since ( jx) / p?1=2 (x) ( ) f(xj ) d = Z (x) ( ) f(xj ) d : Np ( . c. that is. Thus. !i > 0: Di erentiating (3.1. Example 3. the Bayes estimator associated with is the posterior median of ( jx).2) Z In the setup of Example 1. (3. +1 that is P w R ai i a (x) = P i R iai w . just as the search for stationary state in a dynamical system in physics or in economics can require one or several simulations of successive states of the system. Similarly.1 hence provides a basis for the construction of solutions to our statistical problems. k Consider a loss function which is piecewise quadratic. described in Chapter 4. We now look at a number of examples illustrating these situations. statistical inference on complex models will often require the use of simulation techniques. ) = wij ? j if ? 2 ai. and of the posterior probabilities of these intervals. : : : = cos('1 ). which often rely on Markov chain tools. Z e?kx? k =2 2 p?2 Y i=1 sin('i )p?i?1 d'1 : : :d'p?1 . +1 ( ) f(xj ) d : ( ) f(xj ) d i i ai Although formally explicit. consider a piecewise linear loss function. Chapter 5 deals with the corresponding simulation based optimization methods. ? )2 if j ? j < L( .2 {Piecewise linear and quadratic loss functions{ X i wi Z ai ai +1 ( ? (x)) ( jx) d = 0 .

estimators such as empirical Bayes estimators. ^. best unbiased estimator. the . Given a sampling distribution f(xj ) and a conjugate prior distribution ( j . without taking into account the e ect of the substitution). most priors result only in integral forms of . The estimated distribution ( j ^ .5.1). etc. The corresponding conjugate prior is Np ( . It may seem that the topic of James-Stein estimation is an exception to this observation. the empirical Bayes inference is based on the pseudo-posterior Np ( ^ x=( ^ + 1). or even to establish that a new estimator (uniformly) dominates a standard estimator. Nonetheless. ^Ip =( ^ + 1)).1 ] INTRODUCTION 79 where and c are speci ed constants.3. k Inference based on classical decision theory evaluates the performance of estimators (maximum likelihood estimator. which are quite attractive in practice. in these situations. This leads to the maximum likelihood estimator ^ = (kxk2 ? p + 1)+ . In the empirical Bayes approach. ) d by maximum likelihood. for instance. from the marginal distribution m(xj . The following example illustrates some di culties encountered in evaluating empirical Bayes estimators (see also Example 3. and if it is evaluated under a quadratic loss. the empirical Bayes method estimates the hyperparameters . ). In most cases it is impossible to obtain an analytical evaluation of the risk of a given estimator. also called risks. Example 3. ^) is then used as in a standard Bayesian approach (that is. given the abundant literature on the topic. (1992. to derive a point estimator.1. k k2 is the quantity of interest.) through the loss imposed by the decision-maker or by the setting. ) = Z f(xj ) ( j . Since the posterior distribution of given is Np ( x=( + 1).3 {Empirical Bayes estimator{ Let X have the distribution X Np ( . Chapter 5). If. it is possible to analytically establish domination results over the maximum likelihood estimator or unbiased estimators (see Robert 1994 Chapter 8.3. Estimators are then compared through their expected losses. will rarely allow for analytic expressions. Chapter 9) for a more detailed discussion on this approach. See Maritz and Lwin (1989) or Searle et al. the scale hyperparameter is replaced by the maximum likelihood estimator. ( + 1)Ip ). moment estimator. Ip =( + 1)). for instance = 0. Some of these may be quite complex. Ip ) where the hyperparameter is generally xed. for some families of distributions (such as exponential or spherically symmetric) and some types of loss functions (such as quadratic or concave). Although a speci c type of prior distribution leads to explicit formulas. In fact. based on the marginal distribution X Np (0. This makes their evaluation under a given loss problematic. Ip ). or Lehmann and Casella 1997.

3. In the setup of Decision Theory. IE 2k k1 + p x. this solution is natural.3. k A general solution to the di erent computational problems contained in the previous examples and in those of x1. since they allow for a control of the convergence of simulation methods (which is equivalent to the deterministic bounds used by numerical approaches. and then the resulting Bayes estimator k2 (x) = IE 2k kk2 + p x. ) = IE (k k2 ? )2 ] .80 MONTE CARLO INTEGRATION 3. R( . One can therefore apply probabilistic results such as the Law of Large Numbers or the Central Limit Theorem. This quadratic risk is often normalized by 1=(2k k2 + p) (which does not a ect domination results but ensures the existence of a minimax estimator. of either the true or approximate distributions to calculate the quantities of interest. This is more easily . kxk2 ? p.2 Classical Monte Carlo integration Before applying our simulation techniques to more practical problems. and the maximum likelihood estimator based on kxk2 2 (k k2 ) (see Saxena p and Alam 1982 and Example 1. or Lehmann and Casella 1997. for the three estimators. whether it is classical or Bayesian. Chap. since the proof of this second domination result is quite involved. However. Note that the possibility of producing an almost in nite number of random variables distributed according to a given distribution gives us access to the use of frequentist and asymptotic results much more easily than in usual inferential settings (see Ser ing 1987.1 is to use simulation. we rst need to develop their properties in some detail. one might rst check for domination through a simulation experiment which evaluates the risk function. ??) where the sample size is most often xed.) 3. since risks and Bayes estimators involve integrals with respect to probability distributions.2 empirical Bayes estimator is ! ^ 2 2 ^p eb(x) = ^ + 1 kxk + ^ + 1 #2 " p + kxk2 + p 1 ? p = 1 ? kxk2 kxk2 2 ? p)+ : = (kxk + This estimator dominates both the best unbiased estimator. 2 does not have an explicit form. see Robert 1994).2). We will see in Chapter 5 why this solution also applies in the case of maximum likelihood estimation.

1) IEf h(X)] = Z Based on previous developments. hm = m j j =1 since hm converges almost surely to IEf h(X)] by the Strong Law of Large Numbers. ?0:7).3.3. with = ( cos '1. Xm ) generated from the density f to approximate (3. when h2 has a nite expectation under f. Example 3.1 (Continuation of Example 3. '2) ( > 0.2.2. Moreover. '1 .2. '1 .2. sin '1 cos '2.3) Consider the evaluation of for p = 3 and x = (0:1.1) by the empirical average m 1 X h(x ) . and this leads to the construction of a convergence test and of con dence bounds on the approximation of IEf h(X)]. X h(x) f(x) dx : Z If ( ) is the non-informative prior distribution proportional to k k?2 (see Example 1.3. '1 2 0.2) ( jx) / k k?2 exp ?kx ? k2 =2 : The simulation of (3. '2jx) / exp x ( = ) ? 2 =2 sin('1 ) : IR 3 (2k k2 + 3)?1 ( jx) d : . 2 ]. sin '1 sin '2 ).1. =2]).2 ] CLASSICAL MONTE CARLO INTEGRATION 81 accomplished by looking at the generic problem of evaluating the integral (3. Since 1 (x) = 2 IE (2k k2 + 3)?1 jx]?1 ? 3 . 1:2. we would need a sample ( 1 . : : :. 1) variable. which yields ( .2. m ) from the posterior distribution (3.2). : : :. Xm ) through m 1 X h(x ) ? h ]2 : vm = m2 m j j =1 hm ?p f h(X)] IE vm is therefore approximately distributed as a N (0. : : :. the speed of convergence of hm can be assessed since the variance 1 Z (h(x) ? IE h(X)])2 f(x)dx var(hm ) = m f X can also be estimated from the sample (X1 . it is natural to propose using a sample (X1 . '2 2 ? =2.2) can be done by representing in polar coordinates ( . this requires the computation of IE (2k k2 + 3)?1 jx] = For m large.

1] until N (x . then j'1. The algorithm corresponding to this decomposition is the following 1.2. Simpson method.82 MONTE CARLO INTEGRATION 3. We mentioned in x3.1 gives a realization of a sequence of Tm . '2 jx) / expf(x )2 =2g sin('1 ). Simulate 1 2 from the uniform distribution on the half unit sphere and from 0. Unfortunately. which ensures identi ability for the model.19 Polar Simulation (' . since the performances of complex procedures can be measured as in Example ?? in any setting where the distributions involved in the model can be simulated. '2 ). m 1 X (2 2 + 3)?1 : Tm = m j j =1 Figure 3. etc. ' ) U U U expfx ? jjxjj2=2g A:19] 2. the envelope being constructed from the normal approximation through the 95% con dence interval Tm 1:96pvm : k The approach followed in the above example can be successfully utilized in many cases. and an approximation of IE (2 2 + 3)?1jX]. Since now varies in IR. One can therefore simulate ('1 . 1) The sample resulting from A:19] provides a subsample ( 1 . which only depends on ('1 . '2) is not directly available since it involves the cdf of the normal distribution. The same applies for testing (which is formally a branch of Decision Theory) where the level of . '2jx) / (?x ) expf(x )2 =2g sin('1 ). we can modify the polar coordinates in order to remove the positivity constraint on . m ) in step 2. The integration of then leads to ('1 .3.2 where x = x1 cos('1 ) + x2 sin('1) cos('2 ) + x3 sin('1 ) sin('2 ). '2) based on a uniform distribution on the half unit sphere.) in dimension 1 or 2. : : :. '2 ) is ('1 . '2 Algorithm A. Generate from the normal distribution N (x1 cos('1 ) + x2 sin('1 ) cos('2 ) + x3 sin('1 ) sin('2 ). The scope for application of this Monte Carlo integration method is obviously not only limited to the Bayesian paradigm. The alternative constraint becomes '1 2 ? =2. 1) truncated to IR+ . even though it is often possible to achieve greater e ciency through numerical methods (Riemann quadrature. If we denote = = . which can be simulated by an accept-reject algorithm using the instrumental function sin('1 ) expfjjxjj2=2g.1 the potential of this approach in evaluating estimators based on a decision-theoretic derivation. . the marginal distribution of ('1 . For simulation purposes. =2]. the marginal distribution of ('1 .

Convergence of the Bayes estimator of jj jj2 under normalized quadratic loss for the reference prior ( ) = jj jj?2 and the observation x = (0:1.2.25 CLASSICAL MONTE CARLO INTEGRATION 83 0. Example 3. and simulation thus can provide a useful improvement over asymptotic approximations when explicit computations are impossible.2. 1:2. a possible way to construct normal distribution tables is to use simulation. and its power function. Xn). (X1 .15 0. k 2 . signi cance of a test.3.1. The approximation of Zt 1 p e?y =2dy (t) = ?1 2 by the Monte Carlo method is thus n ^ (t) = 1 X I xi t. Note that greater accuracy is achieved in the tails. : : :. The envelope provides a nominal 95% con dence interval on jj jj2.2 {Normal cdf{ Since the normal cdf cannot be written in an explicit form. For values of t around t = 0 the variance is thus approximately 1=4n and to achieve a precision of four p decimals the approximation requires on average pn = 2 104 simulations. Consider thus the generation of a sample of size n. ?0:7). based on the Box-Muller algorithm A1 ] of Example 2. Table ?? gives the evolution of this approximation for several values of t and shows an accurate evaluation for 100 million iterations.3.2. that is.20 20 40 60 (100 simulations) 80 100 Figure 3.2 ] 0. n i=1 with (exact) variance (t)(1 ? (t))=n (as the variables I xi t are independent Bernoulli with success probability (t)).10 0 0.1. can be easily computed. 200 million iterations.

3. (c) Examine the alternative based on a gamma distribution G a( . Deduce that (3.6 tg Z )) b(Zi ) = 1 + (n(?(Zi ) ? f)(f (iZ ) : 1)(1 ? i Pn+t?1 b(zi). in (3. (Hint: Show that the sum of the weights Sn can be replaced by (n ? 1) in and assume IEf h(X )] = 0. ) on = jj jj2 and a uniform distribution on the angles. p) with p = 10?6 . . show that J (m) is distributed from ( 0 )?n exp(?n J ).7.28 Given a binomial experiment xn B(n. show that 3. ) is not satisfactory from both theoretical and practical points of view. under quadratic loss. which is a matrix with (i.7.) 3.130 MONTE CARLO INTEGRATION ! n+t?1 1 h(Z ) + X b(Z )h(Z ) = n n+t i i i=1 ?1 3. 10?3 . 10?2 .25) If Sn = 1 with ? 1 = n h(Zn+t ) + nS 1 n n+t?1 i=1 X b(Zi )h(Zi ) ! asymptotically dominates the usual Monte Carlo approximation. that is. k) element @ ij (x)=@xk . Extend to the general case by introducing @ j . consider the distribution ? exp ?( ? )t ?1 ( ? )=2 ( )/ : jj jjp?1 (a) Show that the distribution is well-de ned. 3. determine the minimum sample size n for P 1 3. b( ) is de ned by b(x) = b0 (x) + 2 (x) 0 (x) in di- when = 10?1 .2) is unbiased.4). 3. that Z exp ??( ? )t ?1( ? )=2 IRp jj jjp?1 d < 1: (b) Show that an importance sampling implementation based on the normal instrumental distribution Np ( .31 Show that.29 When the Yi 's are generated from (3.1).30 Show that Zt 1 ( (s) ? (0))d (s) = 2 ( (t)2 ? t) 0 xn n?p p mension = 1. Philippe and Robert 1997) For a p p positive de nite symmetric matrix.7. conditional on the number of rejected variables t. 3.27 (Berger.26 (Continuation of Problem 3.

7 Notes 3.3) are de nitely necessary to get an acceptable approximation. where the second integral is de ned as the limit n X (X (si)) + (X (si?1)) lim0 ( (si) ? (si?1)). if (X (t)) is solution to (3. we showed in Example 3. random variables.7.) If M ( ) = IE exp( X1 )] is the moment generating function of X1 .3) can also be expressed through a Stratonovich integral. (When " is small.d. Bucklew (1990) indicates how the theory of large deviations may help in devising proposal distributions in this purpose. 4 3. p.3. (Note: The integral above cannot be de ned in the usual sense because the Wiener process has almost surely unbounded variations.6. and " is large.1 Large deviations techniques When we introduced importance sampling methods in x3.1 that alternatives to direct sampling were preferable where sampling from the tails of a distribution f . (x) = 2 cos2 (x). the large deviation approximation is 1 log P (S 2 F ) ? inf I (x): This result is sometimes called Cramer's theorem. methods such as importance sampling (x3. Very sketchily. If the problem is to evaluate ! n 1 X h(x ) 0 . in the sense that it involves the unknown constant I .3). say p(A) 10?6 . Since the optimal choice given in Theorem 3. The simulation device based on this approximation is called twisted simulation. See Talay 1996. In particular. X (t) = X (0) + Zt 0 b0 (X (s))ds + Zt 0 (X (s)) d (s).i. the normal approximation based on the Central Limit Theorem works well enough.3.3) dX (t) = ?X (t)dt + 2d (t) is a stationary N (0. 3. and if I (x) = sup f x? log M ( )g. ! i=1 2 where 0 = s0 < s1 < : : : < sn = t and maxi (bi ?si?1 ) . 1) process. n goes to in nity. When the event A is particularly rare.6. I=P n n F n i=1 i .) 3. : Sn = X1 + :n: + Xn . the theory of large deviations is concerned with the approximation of tail probabilities P (jSn ? j > ") when Sn is a sum of i.34 Show that. more practical choices have been proposed in the literature.32 Show that the solution to (3.3.7 ] NOTES 131 3.7. Y (t) = atan X (t) is solution of p a SDE with b(x) = 1 sin(4x) ? sin(2x).4 is formal.3.33 Show that the solution to the Ornstein-Uhlenbeck equation p (3.3.56.

This description of simulation methods for SDE's borrows heavily from the expository paper of Talay (1996) which presents in much deeper details the use of simulation techniques in this setup.7. independent components and correlation IE i (t) j (s)] = ij min(t. Applications of SDE's abound in uid mechanics. given the Wiener process ( (t)). It may also be of interest to compute the expectation IE h(X )] under the stationary distribution of (X (t)).32). where ij = IIi=j . i=1 where 0 = s0 < s1 < : : : < sn = t and maxi (bi ? si?1 ) lim0 ! IE n X whenever (X ) is square-integrable. it is often of interest to consider the perturbation of this equation by a random noise. when this process is ergodic. In this setup. random mechanics and particle Physics. (3. o (3. e.. a setting we will encounter again in the MCMC method (see Chapters 4 and 7). where b0 is a function from IRd to IRd . or to evaluate expectations of the form IE h(X (t))].7.3.2 Simulation of stochastic di erential equations Given a di erential equation in IRd . s) . (t).4) X (t) = X (0) + Zt 0 b(X (s))ds + Zt 0 (X (s))d (s). dX (t) = b0 (X (t))dt.7.7 ] NOTES 133 3.4) is based on the discretization X (t) X (0) + b(X (0))t + (X (0))( (t) ? (0)) .7. This limit exists j (X (s))j2ds < 1: See. where b is derived from b0 and (see Problems 3.3.g.7. A rst approximation to the solution of (3. . that is such that (t) is a Gaussian vector with mean 0. The solution of (3.7.6 being the variance factor taking values in the space of d d matrices. dX (t) = b (X (t)) + (X (t)) (t). that is when Zt 0 .3) can also be represented through an It^ integral. The perturbation in (3. simulations are necessary to produce an approximation of the trajectory of (X (t)). and the second integral is the limit (X (si ))( (si) ? (si?1 )).3) is often chosen to be a Wiener process. Ikeda and Watanabe (1981) for details.31 and 3. 6 The material in this section is of a more advanced mathematical level than the remainder of the book and it will not be used in the sequel.3) 0 dt which is called a stochastic di erential equation (SDE).

Other perspectives can be found in Doob (1953). Revuz (1984) and Resnick (1994) for books entirely dedicated to Markov chains. on which this chapter is based. more generally. Roberts and Tweedie 1995 or Phillips and Smith 1996). It. Hastings (1970) notes that the use of pseudo-random generators and the representation of numbers in a computer imply that the Markov chains related with Markov chain Monte Carlo methods are.CHAPTER 4 Markov Chains Version 1. since such an approximation depends on both material and algorithmic speci cs of a given technique (see Roberts. to understand the literature on this topic. note that we do not deal in this chapter with Markovian models in continuous time (also called Markovian processes) since the very nature of simulation leads1 us to consider only discrete time stochastic processes. its style and presentation di er from those of other chapters. and we refer the reader to Meyn and Tweedie (1993).1 February 27. unfortunately. Chung (1967). Feller (1970. and Nummelin (1984). especially with regard to the plethora of de nitions and theorems and to the rarity of examples and proofs. Rosenthal and Schwartz. Indeed. . Before formally introducing the notion of a Markov chain. 1998 In this chapter we introduce fundamental notions of Markov chains. Given the purely utilitarian goal of this chapter. along with basic notions of probability theory. However. for a thorough introduction to Markov chains. (Xn )n2IN. we include a preliminary section on the essential facts about Markov chains that are necessary for the next chapters. 1 Some Markov chain Monte Carlo algorithms still employ a di usion representation to speed up convergence to the stationary distribution (see for instance x??. is necessarily a brief and therefore incomplete introduction to Markov chains. will provide enough foundation for the understanding of the following chapters. 1971). we also consider arbitrary state space Markov chains to allow for continuous support distributions and to avoid addressing the problem of approximation of these distributions with discrete support distributions. Billingsley (1995) for general treatments. nite state space Markov chains. in fact. Thus this chapter. In order to make the book accessible to those who are more interested in the implementation aspects of MCMC algorithms than in their theoretical foundations. and state the results that are needed to establish the convergence of various MCMC algorithms and .

1).2). where n is generated independently of Xn .9. that is that the average number of visits to an arbitrary set A is in nite (De nition 4. The chains encountered in MCMC settings enjoy a very strong stability property. de ned as Xn+1 = Xn + n . In a simulation setup. and are essential for the study of MCMC methods. that is such that the probability of an in nite number of returns to A is 1 (De nition 4. Xn?1. this rst section provides a brief survey of the properties of Markov chains that are contained in the chapter.1 as the existence of n 2 IN such that P(Xn 2 AjX0) > 0 for every A such that (A) > 0). which is a conditional probability density.136 MARKOV CHAINS 4.1) N n=1 n converges to the expectation IE h(X)] almost surely. When the chain is 4.11). like geometric and uniform convergence (see De nitions 4.8 and 4. and we need Harris recurrence to guarantee convergence from every starting point.4. for a study of the in uence of discretization on the convergence of Markov chains associated with Markov chain Monte Carlo algorithms).4. A typical example is provided by the random walk process.1). Markov chains are constructed from a transition kernel K (De nition 4.2.5). Thus. (Therefore.6. notwithstanding the initial value of X0 . Starting with Section 4. The stationary probability is also a limiting distribution in the sense that the limiting distribution of Xn+1 is under the total variation norm (see Proposition 4. Xn+1 .1. the theory of Markov chains is developed from rst principles. a distribution such that.6.) This latter point is quite important in the context of MCMC algorithms. we are in e ect starting the algorithm from a set of measure zero (under a continuous dominating measure). : : : (see Example 4. a most interesting consequence of this convergence property is that the average N 1 X h(X ) (4. Stronger forms of convergence are also encountered in MCMC settings. if the kernel K allows for free moves all over the state space (this freedom is called irreducibility in the theory of Markov chains and is formalized in De nition 4. or even Harris recurrent. That is. Since most algorithms are started from some arbitrary point x0. which ensures that the chain has the same limiting properties for every starting value instead of almost every starting value. For those familiar with the properties of Markov chains.1 1995. this appears as an equivalent for Markov chains of the notion of continuity for functions.6.8).2.4).5. This property also ensures that most of the chains involved in MCMC algorithms are recurrent.4. In the setup of MCMC algorithms. if Xn .3. insuring that the chain converges for almost every starting point is not enough. Xn+1 ).1 Essentials for MCMC . as Xn+1 K(Xn . namely a stationary probability distribution exists by construction (De nition 4.

Chap. Example 4. K( . the transition kernel simply is a (transition) matrix K de ned by Pxy = P(Xn = yjXn?1 = x) . De nition 4. x)dx. which is independent of Xm .3.4. M g and (Xn ) such that Xn represents the state. since. of a tank which contains exactly M particles and is connected to another identical tank. and in fact is mathematically somewhat cleaner. that is when the transition kernel is symmetric. Px(x+1) = MM x Pxx = 2 M 2 x(x?1) and P01 = PM (M ?1) = 1. and the moves are restricted to a single exchange of particles between both tanks at each instant. It therefore seems natural. see Feller. In Chapter 8. to de ne the chain in terms of its transition kernel. a Central Limit theorem also holds for this average. K(x. at time n. the function that determines these transitions. y 2 X : In the continuous case. (ii) 8A 2 B(X ). The set C is then called a small set (De nition 4. diagnoses will be based on a minorization condition. Two types of particles are introduced in the system. ) is a probability measure. If Xn denotes the number of particles of the rst kind in the rst tank at time n. y < M) Pxy = 0 if jx ? yj > 1 . 1970.1).7) and visits of the chain to this set can be exploited to create independent batches in the sum (4. P = M . with probability of a transition depending on the particular set that the chain is in. x. P(X 2 Ajx0) = A K(x0 . with probability m the next value of the Markov chain is generated from the minorizing measure m .2 { Bernoulli-Laplace Model{ Consider X = f0.1. the transition matrix is given by (for 0 < x.2 Basic notions . 1. m > 0. ). XV) k 4. A Markov chain is a sequence of random variables that can be thought of as evolving over time. (This model is the Bernoulli{Laplace model. A) is measurable. ? 2 x 2 x(M ? x) . That is. that is the existence of a set C such that there also exists m 2 IN.2.2 ] BASIC NOTIONS 137 reversible.2. . the kernel also denotes the conditional density R K(x. and there are M of each type. x0) of the transition K(x. When X is discrete. and a probability measure m such that P(Xm 2 AjX0) m m (A) when X0 2 C.1 A transition kernel is a function K de ned on X B(X ) such that (i) 8x 2 X .4.

.3) is often implemented in a time-heterogeneous form and studied in time-homogeneous form. K ) on and the new value of the chain is given by Xn+1 = Y with probability expf(E(Y ) ? E(Xn ))=T g ^ 1.3) K(xk . the distribution of X0 . of random variables is a Markov chain. dx) 0 0 The chain is time-homogeneous if the distribution of (Xt .2. Similarly. Xn n = K n . the initial state of the chain. .3 Given a transition kernel K. if denotes the initial distribution of the chain. So in the case of a Markov chain. In the discrete case. : : :. K g.4 {Simulated Annealing{ The simulated annealing algorithm (see x5. given an initial distribution = (!1 . (See also the case of the ARMS algorithm in x6. = f1. for any t. the construction of the Markov chain (Xn ) is entirely determined by its transition. namely if (4. x2. x0 is the same as the distribution of Xt given xt?1.2. the simulated annealing Markov chain X0 . . Xn. in particular for equal to the Dirac mass x . !2.2) X0 .2. xk ) = P(Xk+1 2 Ajxk ) Z = (4. De nition 4. X1. Xtk ) given xt is the same as the distribution of (Xt ?t . . plays an important role. X1. x1. an energy function E and a temperature T.138 MARKOV CHAINS 4. That is. by repeated multiplication.3. 2. the conditional distribution of Xt given xt?1.2 The chain (Xn ) is usually de ned for n 2 IN rather than for n 2 Z Z. Xn otherwise.2. .2. When X0 is xed.3) Example 4. The study of Markov chains is almost always restricted to the timehomogeneous case and we omit this designation in the following. namely by the distribution of Xn conditionally on xn?1. Xt ?t . : : :). P(Xk+1 2 Ajx0. we use the alternative notation Px . denoted by (Xn ). xt?2. It is.2). Xtk ?t ) given x0 for every k and every (k + 1)-uplet t0 t1 tk . Therefore. Given a nite state space with size K. if the initial distribution or the initial state is known. important to note here that an incorrect implementation of Markov Chain Monte Carlo algorithms can easily produce time-heterogeneous Markov chains for which the standard convergence properties do not apply. Y is generated from a xed probability distribution ( 1 .1) 1= K and. the marginal probability distribution of X1 is then (4. . in the continuous case. a sequence X0.2. ::: is represented by the following transition operator: Conditionally on Xn . 1 0 1 0 2 0 0 A .4. then we let P denote the probability distribution of (Xn ) under condition (4.2. if. however.

dyn?1) : In particular. A 2 B(X ). the kernel for n transitions is given by (n > 1) Z n (x. dy): X The following result provides convolution formulas of the type K m+n = K m ? K n .4. Nonetheless. The Markovian properties of an AR(q) process can be derived by considering the vector (Xn . A) = K K n?1(y. A2 ) K(x.) If we denote K 1(x. A) K m (x. Lemma 4. we will see that the objects of interest are often these conditional distributions. ARMA(p.2 ] BASIC NOTIONS 139 If the temperature T depends on n. dy) : . n) 2 IN2.6 Chapman-Kolmogorov equations For every (m. k Example 4. and if the "n 's are independent. 2 IR. and it is important that we need not worry about di erent versions. q) models do not t in the Markovian framework (see Problem 4. x 2 X . the fact that the kernel K determines the properties of the chain (Xn ) can be deduced from the relations Px(X1 2 A1 ) = K(x. . : : : conditionally on Xn?1. An) K(x. On the contrary. X n ) 2 A1 ZA 1 Z K m+n (x. you must pass through some X K n (y. An ) = A1 An?1 . A). A) = K(x. A) K(x. Z K(y1 . However.2. which are called Chapman{Kolmogorov equations. A1) indicates that K(xn. A1 ) . dependent from Xn?2 . dy1 ) K(yn?2 . If Xn = Xn?1 + "n . dxn+1) is a version of the conditional distribution of Xn+1 given Xn . Xn?3. Px ((X1 . 2). This is why we noted that constructing the Markov chain through the transition kernel was mathematically \cleaner". k In the general case.2). Xn?q+1 ). A) = Z (In a very informal sense. X2 ) 2 A1 A2 ) = K(yn?1 . as we have de ned a Markov chain by rst specifying this kernel. the relation Px (X1 2 A1 ) = K(x. (Moreover.5 {AR(1) Models{ AR(1) models provide a simple illustration of Markov chains on continuous state space.4. in the following chapters. the chain is time-heterogeneous. Xn is indeed inwith "n N(0. the properties of a Markov chain considered in this chapter are independent of the version of the conditional probability chosen. we do not need to be concerned with di erent versions of the conditional probabilities. dy1) Px ((X1 .2. the Chapman-Kolmogorov equations are stating that to get from x to A in m + n steps.

x2. a function (x1 . namely K n = K K n?1. )jx0. Z Kh(x) = h(y) K(x.140 MARKOV CHAINS 4.2. (4. This will be used later to establish many properties of the original chain.2.2. In the general case. the (weak) Markov property can be written as the following result.6) A= 1 X t=1 I A (Xt ). A). being the dominating measure of the model. xn] = IExn h(X1 .2. Xn). and we will see that the resulting Markov chain (Xn ) enjoys much stronger regularity. the number of passages of (Xn ) in A.2. by convention.2.7 A resolvant associated with the kernel P is a kernel of the form K" (x. and follows directly from (4.1).8 Weak Markov property For every initial distribution and every (n + 1) sample (X0 . . Xn 2 Ag.2. De nition 4.6 is simply interpreted as a matrix product. (4. Xn+2 . 0 < < 1. . A = +1 if xn 62 A for every n. i. K n is then the n-th composition of P. Associated with the set A. More generally. and is called the stopping time at A with. in the convergence control of Markov Chain Monte Carlo algorithms in Chapter 8. provided that the expectations exist. )]. Note that if h is the indicator function then this de nition is exactly the same as 4. Given an initial distribution . which just rephrases the limited memory properties of a Markov chain: Proposition 4.4) can be generalized to other classes of functions|hence the terminology \weak"| and it becomes particularly useful with the notion of stopping time. Xn). The rst n for which the chain enters the set A is denoted by (4. If IE ] denotes the expectation associated with the distribution P . . : : :. X2 .e. De nition 4.2. we can associate with the kernel K" a chain fXng which formally corresponds to a sub-chain of the original chain (Xn ).9 Consider A 2 B(X ). Lemma 4.4. A) = (1 ? ") 1 X i=0 "i K i (x. However. and the chain with kernel K" is said to be a K" -chain. we need to consider K as an operator on the space of integrable functions.2 y on the nth step.5) A = inf fn 1. where the indices in the subchain are generated from a geometric distribution with parameter 1 ? ". Thus K is indeed a kernel. dy) .2. : : :) is called a stopping rule if the set f = ng is measurable for the -algebra induced by (X0 .3.2.) In the discrete case. we also de ne (4.4) IE h(Xn+1. h 2 L1( ) .

12 (see Note 4. the random walk is recurrent.y)V (y)dy (b) Establish Theorem 4.3.3). (Xn ) is recurrent. 4g and A3 = f5g.1 Suppose that the stationary Markov chain (Xn ) is geometriR cally ergodic with M = jM (x)jf (x)dx < 1. for an aperiodic irreducible Markov chain with nite state space and with transition matrix IP. 4. and establishing that V (x ) M 1 ? Px ( C < 1)].3). pnX = tends in law to N (0. such that V is bounded on C and satis es (4.nite invariant measure.9.58 Show thatxthe random walk on ZZ is transient when IE Wn ] 6= 0.9. 4. (Hint: Use V (x) = log(1 + x) for x > R and V (x) = 0.8.9 is lumpable for A1 = f1.9. the corresponding chain is Harris positive.7. choosing M such that M V (x )= 1 ? Px ( C < 1)]. there always exists a stationary probability distribution which satis es = IP: (e) Show that. 4. with IE n ] = . (Hint: Use Theorem 4. for an adequate bound R.12. otherwise. (Hint: Consider an alternative V to V and show by recurrence that V (x) Z : : : V (x) : C K (x. 2 ).) (g) Show that.9.11) are either both recurrent or both transient. A2 = f3. if an irreducible Markov chain has a .9.60 (Chan and Geyer 1994) Prove that the following Central Limit Theorem can be considered a corollary to Theorem 4.57 Show that.) (d) (Kemeny and Snell 1960) Show that. and satis es the moment conditions of Theorem 4. this measure is unique up to a multiplicative factor. (a) Establish Lemma 4. 4. if > 0.174 MARKOV CHAINS 4.9.4.9. if < 0.9. 2g. (c) Show that. (Hint: Use V (x) = 1 ? % for x > 0 and 0 otherwise when IE Wn ] > 0.6. Then 2 = limn!1 n varXn < 1 and.59 Show that the chains de ned by the kernels (4.) (f) Show that. if = 0 and var( n) < 1. if 2 > 0.9) and (4. the random walk is transient. n (Hint: Integrate (with respect to f ) both sides of De nition 4.56 Consider the random walk on IR+ .) . if there exist a nite potential function V and a small set C Corollary 4. and apply Theorem 4.y)V (y)dy + Z Cc K (x.8 to conclude that the chain is exponentially fast -mixing.3 by assuming that there exists x such that Px ( C < 1) < 1.9. (Hint: Use the drift function V (x) = x.1. Xn+1 = (Xn + n)+ .) 4.12.

On the other hand.9. e. the smallest positive function which satis es the conditions (4. Lemma 4.9.7 or Mengersen and Tweedie 1996).9. 1 if x 2 C ~ Since V (x) < 1 for x 2 C c . The condition is thus incompatible with the stability associated with recurrence.4. Meyn and Tweedie (1993) rely on another tool to check or establish various stability results. .9. The converse can be deduced from a (partial) converse to Proposition 4. while C = 0 on C. if there exists a potential function V \attracted" to 0. we have (4. and this implies the transience of C .1) are satis ed by n c ~ V (x) = (M ? V (x))=(M ? r) if x 2 C . and therefore does not allow a sure return to 0 of V . the chain is recurrent.9. V (x) 1 if x 2 C = is given by V (x) = Px ( C < 1) .3 Consider (Xn ) a -irreducible Markov chain.9 Notes 4. the drift of V is de ned by Z V (x) = V (y) P (x.4. 2 Condition (4.3. namely the drift criteria which can be traced back to Lyapunov..9. the conditions (4. V (x) ng.4. Given a function V on X .g.9 ] NOTES 175 4.9. 8n is a small set. the chain is recurrent if V (x) 0 on C c : .2) describes an average increase of V (xn ) once a certain level has been attained. Theorem 4.1 If C 2 B(X ). If C = fx.190). when C denotes C = inf fn 0.1 Drift conditions Besides atoms and small sets. Theorem 6.2 The -irreducible chain (Xn ) is transient if. V (x) rg and M is a bound on V . xn 2 C g : Note that. We then have the following = necessary and su cient condition: Theorem 4. therefore the transience of (Xn ). and only if. there exist a bounded positive function V and a real number r 0 such that every x for which V (x) > r. p. If there exist a small set C and a function V such that CV (n) = fx. V (x) = Px ( C < 1) < 1 on C c .2) V (x) > 0 : Proof. C = C . The following lemma is instrumental in deriving drift conditions for the transience or the recurrence of a chain (Xn ). if x 2 C .1) V (x) 0 if x 2 C.7 (see Meyn and Tweedie 1993.9.dy) ? V (x) : This notion is also used in the following chapters to verify the convergence properties of some MCMC algorithms (see.

for every C such that (C ) < +1.8).10) for every set C can be overwhelming. between the admissibility of an estimator and the recurrence of an associated Markov chain. then 2 2 g = nlim nIE Sn (g )] !1 = IE g2 (x0 )] + 2 1 X k=1 IE g(x0 )g(xk )] Sn (g).9. independently of the estimated functions g: Theorem 4. For every measurable set C such that (C ) < +1.178 MARKOV CHAINS 4. it is possible to assess the convergence of the ergodic averages Sn (g) to the quantity of interest IE g].4. The problem considered by Eaton (1992) is to determine whether.8 also suggests how to implement this control through renewal theory. h2V (C ) then the Bayes estimator IE g( )jx] is admissible under quadratic loss for every bounded function g.9. Assuming that the posterior distribution ( jx) is well de ned. Note also that (4. as shown in Chapter 7. h( ) 0 and h( ) 1 when 2 C and ZZ (h) = fh( ) ? h( )g2 K ( . ) = Z which is associated with a Markov chain ( (n) ) generated as follows: the transition from (n) to (n+1) is done by generating rst x f (xj (n) ) and then (n+1) ( jx). consider V (C ) = h 2 L2 ( ). as discussed in detail in Chapter 8. a generalized Bayes estimator associated with a prior measure is admissible under quadratic loss. this is also a kernel used by Markov Chain Monte Carlo methods.e.9 If. If is nonnegative and nite.. This result is obviously quite general but only mildly helpful in the sense that the practical veri cation of (4.8 If the ergodic chain (Xn) with invariant distribution satis es conditions (4. nSn (g) almost surely goes to 0.) Note that the prior measure is an invariant measure for the chain ( (n) ). This theorem is de nitely relevant for convergence control of Markov Chain 2 Monte Carlo algorithms since.9.9 Theorem 4. (4.10) always holds when is a proper prior distribution since h 1 belongs to L2 ( ) and (1) = 0 in this case. Theorem 4. If g > 0.9. for a bounded function g( ). 4. he introduces the transition kernel (4. (Most interestingly.9. i.9.9. for every function g such that jgj f . ) ( ) d d : The following result then characterizes admissibility for all bounded functions in terms of and V (C ).9. The extension then X ( jx)f (xj ) dx.9) K( . when g > 0.2 Eaton's Admissibility Condition Eaton (1992) exhibits interesting connections. similar to Brown (1971).10) inf (h) = 0. . the Central Limit Theorem holds for p g = 0.9.

Consider the property of -mixing (Billingsley 1995. the covariances go to zero (Billingley 1995. De nition 4. This is in fact the case. As a result.4. we would expect that Theorem 4. Eaton (1992) exhibits a connection with the Markov chain ( (n) ) which gives a condition equivalent to Theorem 4. A. Then 2 = limn!1 n varXn < 1 and. First. Note. So we see that an -mixing sequence will tend to \look independent" if the variables are far enough apart.59) and derive admissibility results for various distributions of interest. 2 ). that the veri cation of the recurrence of the Markov chain ( (n) ) is much easier to operate than the determination of the lower bound of (h).4. X1 . X0 2 B ) ? P (Xn 2 A)P (X0 2 B )j goes to 0 when n goes to in nity. the associated Markov chain ( (n) ) is recurrent. these conditions are usually quite di cult to verify. Section 27). Section VII. However. Section 27). which allowed us to use a typical independence argument.9. That is. we refer to Eaton (1992) for extensions. One version of a Markov chain Central Limit Theorem is the following (Billingsley 1995. These mixing conditions guarantee that the dependence in the Markov chain decreases fast enough. examples. as every positive recurrent aperiodic Markov chain is -mixing (Rosenblatt 1971. Theorem 4.11) K 0(x.9.3 Mixing Conditions and Central Limit Theorems In x4. we need the dependence to go away fast enough.9. Therefore. Hobert and Robert (1997) consider the potential use of the dual chain based on the kernel Z inf (h) = h2V (C ) Z C 1 ? P ( C < +1j (0) = ) ( )d : (4. Not only must the Markov chain be -mixing.10 For every set C such that (C ) < +1. and variables that are far enough apart are close to being independent.3 is a consequence of -mixing. Section 27): Theorem 4. is -mixing if (4.B . can also result in a Central Limit Theorem. Other conditions. and only if.2 we established a central limit theorem using regeneration. if 2 > 0. for a Central Limit Theorem we need even more. y) = f (yj ) ( jx)d (see Problem 4. 4.12 Suppose that the Markov chain (Xn ) is stationary and 12 mixing with n = O(n?5 ) and that IE Xn ]p= 0 and IE Xn ] < 1. and if the Markov chain is stationary and -mixing.9 ] NOTES 179 considers approximations of 1 by functions in V (C ).7.9. Again. Unfortunately.9. and comments on this result.9. nXn tends in law to N (0.3).9.7. X2 . for a given set C . we need the coe cient n to go to 0 fast enough. known as mixing conditions. however.12) n = sup jP (Xn 2 A. the generalized Bayes estimators of bounded functions of are admissible if.9. a stopping rule C is de ned as the rst integer n > 0 such that ( (n) ) belongs to C (and +1 otherwise).11 A sequence X0 .

.CHAPTER 5 Monte Carlo Optimization Similar to the problem of integration. as there exist simulation approaches where the probabilistic interpretation of h is not used. but simulation has the advantage of bypassing the preliminary steps of devising an algorithm and studying whether some regularity conditions on h hold . which enjoy a longer history than simulation methods (see. complex loss functions but also con dence region also require optimization procedures.1. As noted in the introduction to Chapter 3. N) xi = cos(!ti ) + sin(!ti ) + i . For the simulation approach. for instance Kennedy and Gentle 1980. di erences between the numerical approach and the simulation approach to the problem (5. 1 Although we use as the running parameter. 2). there may exist an alternative numerical approach which provides an exact solution to (5.) In approaching a minimization problem using deterministic numerical methods. smoothness) are often paramount.1. and h typically corresponds to a possibly penalized transform of the likelihood function.1) also covers minimization problems by considering ?h. By comparison with numerical methods. the appeal of simulation can be found in the lack of constraints on both the regularity of the domain and on the function h. Thisted 1988).1. Obviously. Obviously.1. the analytical properties of the target function (convexity.1) max h( ) 2 5. Sakarovitch 1984.1).1 Introduction lie in the treatment of the function1 h. Nonetheless. : : :. boundedness. of which a simple model is (i = 1. i N (0. Example 5. (Note that (5. this dichotomy is somewhat arti cial. a property rarely achieved by a stochastic algorithm. this setup applies to many other inferential problems than just likelihood or posterior maximization. the use of the analytical properties of h plays a lesser role in the simulation approach. Ciarlet and Thomas 1982.1 {Signal processing{ O Ruanaidh and Fitzgerald (1996) study signal processing data. we are more concerned with h from a probabilistic (rather than analytical) point of view. This is particularly true when the function h is very costly to compute.

. 5. which can often be achieved by a reparameterization. in which the goal is to optimize the function h by describing its entire range. The likelihood function is then of the form ? ? ! (x ? G )t (x ? G ) ?N exp ? .182 MONTE CARLO OPTIMIZATION 5.1 A basic solution . although explicit in !. such as the EM algorithm. with the Monte Carlo aspect more closely tied to the exploration of . !.5) algorithms take advantage of the Monte Carlo approximation to enhance their particular optimization technique. . : : :.1) is to simulate from a uniform distribution on . the Monte Carlo aspect exploits the probabilistic properties of the function h to come up with an acceptable approximation. um U . tN . The rst is an exploratory approach.3. k Following Geyer (1996). as shown by O Ruanaidh and Fitzgerald (1996). 1 . we want to consider two approaches to Monte Carlo optimization. even though we are considering these two di erent approaches separately. is not particularly simple to compute. which. !. if is bounded.2) (!jx) / xt x ? xtG(GtG)?1 Gtx (2?N )=2 (det Gt G)?1=2.2 Stochastic Exploration There are a number of cases where the exploration method is particularly well-suited. h(um )). and to use the approximation hm = max(h(u1 ). This setup is also illustrative of functions with many modes.5. : : :. a rst approach to the resolution of (5. 2 2 with x = (x1. xN ) and 0 cos(!t ) sin(!t ) 1 1 B .2.3) or the Robbins-Monro (5.3. We will see that this approach can be tied to missing data methods. A cos(!tN ) sin(!tN ) ?1 then leads to the marginal distribution The prior ( . . and is less concerned with exploring . In fact. they might be combined in a given problem. u1. (Such a technique can be useful in describing functions with multiple modes.1. Obviously.. The actual properties of the function play a lesser role here. : : :. This method is convergent (as 5.) The second approach is based on a probabilistic approximation of the objective function h and is rather a preliminary step to the optimization per se. . C : G=@ .1. Here. ) = ? (5. and observation times t1.2 with unknown parameters . for example. : : :. First. even though the slope of h can be used to speed up the exploration. We note also that Geyer (1996) only considers the second approach to be \Monte Carlo optimization". methods like the EM (5.

Since this function has many local minima.1. : : :. if these conditions are not satis ed. ). yi)'s. for X f(xj ). the solution of (5. as shown by Figure 5. An alternative is to simulate from the density proportional to h1 (x. a second. Distributions other than the uniform which are related with h may then do better.1.1) is the mode of the marginal distribution of .1).2. direction consists in relating h to a probability distribution.1) amounts to nding the modes of the density h. )].1. the function h( ) can be transformed into a positive and integrable function H( ) in such a way that the solutions to (5. The appeal of simulation is even clearer in the case when h( ) = IE H(x.1 that it includes the case of missing data models. : : :. it may be more useful to decompose h( ) into h( ) = h1( )h2 ( ) and to simulate from h1 . k Exploration may be particularly di cult when the space is not convex (or perhaps not even connected). More generally. via Markov Chain Monte Carlo techniques. For instance.1. In some cases. the distribution on IR2 with density proportional to exp(?h(x.1) are those which maximize H( ) on . and the simulation of the sample ( 1 . which eliminates the computation of both cosh and sinh in the simulation step. m ) can be much faster than a numerical method applied to (5. m ) from h (or H) and to apply a standard mode estimation method (or to simply compare the h( i )'s).) . see xsec:2. In particular. Example 5. and a convergent approximation of the minimum of h(x. and more fruitful. even though this is not a standard distribution. the number of evaluations should be kept to a minimum. it becomes natural to then generate a sample ( 1 .5. Therefore. y) = (x sin(20y) + y sin(20x))2 cosh(sin(10x)x) + (x cos(10y) ? y sin(10x))2 cosh(cos(20y)y) to be minimized. y)) can be simulated. y) can be derived from the minimum of the resulting h(xi . we can take H( ) = exp(h( )=T) or H( ) = expfh( )=T g=(1 + expfh( )=T g) and choose T to accelerate convergence or to avoid local maxima (as in simulated annealing. it does not satisfy the conditions under which standard minimization methods are guaranteed to provide the local minimum. If it is possible to simulate from the density H(x. in setups where the likelihood function is extremely costly to compute.2.3.5.1 {Minimization{ Consider the function in IR2 h(x. if h is positive and if Z h( ) d < +1 .2 ] STOCHASTIC EXPLORATION 183 m goes to 1) but it may be very slow since it does not take into account any speci c features of h. On the other hand. the resolution of (5. For example.3).1. (This setting may sound very contrived or even arti cial but we will see in x5.3. y) = expf?(x sin(20y) + y sin(20x))2 ? (x cos(10y) ? y sin(10x))2 g. When the problem is expressed in statistical terms.

2. For various choices of the sequence ( j ) (see Ciarlet and Thomas 1982). when the domain IRd and the function (?h) are convex. y) = h(x + y) ? h(x ? y) which approximates 2jjyjjrh(x).5 0 Y 0 X -0. with j uniformly distributed on the unit sphere jj jj = 1 and h(x. One of these stochastic modi cations is to choose a second sequence ( j ) to de ne the chain ( j ) by j h( .6.2.2. y) of Example 5. In more general setups. Grid representation of the function h(x.1. as described in detail in Rubinstein (1981) or Du o (1996.2. The sequence ( j ) is constructed in a recursive manner through (5.5. where rh is the gradient of h.5 1 0.1. 1]2 . that is when the function or the space is less regular.2) j +1 = j + 2 j j j) j . As mentioned in x1.1).1) j +1 = j + j rh( j ) . the algorithm converges to the (unique) maximum.1 on ?1.1) which produces a sequence ( j ) converging to the exact solution of (5. the gradient method is a deterministic approach. .5 -1 -1 Figure 5. (5.2. 61{63).5 -0.184 MONTE CARLO OPTIMIZATION 5. in numerical analysis.1. Contrary to the dej 5.2 0 1 1 2 Z 3 4 5 6 0.1) can be modi ed by stochastic perturbations to achieve again convergence properties.2 Gradient Methods . to the problem (5. (5.2. j >0. pp. We now look at several methods to nd maxima that can be classi ed as exploratory methods.

2. y) with different sequences of j 's and j 's. both in location and values.2.2. while Case 2 converges to the closest local minima. Note that Case 1 ends up with a very poor evaluation of the minimum. However. The nal convergence along a valley of h after some initial big jumps is also noteworthy. The solutions are quite distinct for the three di erent sequences. :8).2 (Continuation of Example 5.2.5.1.2. for some results in this direction).2 illustrate the convergence of the algorithm to di erent local minima of the function h. Figure 5. Results of three stochastic gradient runs for the minimisation of the function h in Example 5. 0:786) 0:00013 0:00013 93 1=10 log(1 + j ) 1=j (0:0004.1) We can apply the iterative construction (5. Note at this stage that ( j ) can be seen as a non-homogeneous Markov chain which almost surely converges to a given value.2. with occurrences where the sequence h( j ) increases and avoids other local minima. j ) and starting point (:65. j ). k Table 5.1 with di erent values of ( j . As shown by Table 5. due to a fast decrease of ( j ) associated with big jumps in the rst iterations.2 ] STOCHASTIC EXPLORATION 185 terministic approach. su ciently strong conditions such as the decrease of j towards 0 and of j = j to a non-zero constant are enough to guarantee the convergence of the sequence ( j ).2.2. 0:245) 4:24e ? 06 2:163e ? 07 58 This approach is still (too) close to numerical methods in that it requires a precise knowledge on the function h.5. 1:02) 1:287 0:115 50 1=100j 1=10j (0:629. The iteration T is obtained by the stopping rule jj T ? T ?1 jj < 10?5 . the study of these chains is particularly arduous given their ever-changing transition kernel (see Winkler 1996. h( T ) mint h( t ) iteration j j T 1=10j 1=10j (?0:166. namely that slower decrease rates of the sequence ( j ) tend to achieve better minima.2 and Table 5. the number of iterations needed to achieve stability of T also varies with the choice of ( j . Example 5. The convergence of ( j ) to the solution again depends on the choice of ( j ) and ( j ). this method does not necessarily proceed along the steepest slope in j but this property is a plus in the sense that it may avoid traps in local maxima or in saddlepoints of h. Case 3 illustrates a general feature of the stochastic gradient method. .2. which is not necessarily available.2) to the multi-modal function h(x.

186

1.1

MONTE CARLO OPTIMIZATION

5.5.2

0.8

0.9

1.0

-0.2

0.0

0.2

0.4

0.6

(1) j = j = 1=10j

0.805 0.785 0.790 0.795 0.800

0.630

0.635

0.640

0.645

0.650

(2) j = j = 1=100j

0.8 0.2

-0.2

0.4

0.6

0.0

0.2

0.4

0.6

(3) j = 1=10 log(1 + j ); j = 1=j Figure 5.2.2. Stochastic gradient paths for three di erent choices of the sequences ( j ) and ( j ) and starting point (:65; :8) for the same sequence ( j ) in (5.2.2). The grey levels are such that darker shades mean higher elevations. The function h to minimize is de ned in Example5.2.1.

5.5.2 ]

STOCHASTIC EXPLORATION

187

The simulated annealing algorithm2 has been introduced by Metropolis (1953) et al. to minimize a criterion function on a nite set with very large size3 , but it also applies to optimization on a continuous set and to simulation (see Ackley et al. 1985 and Neal 1994). The fundamental idea at the core of simulated annealing methods is that a change of scale, called temperature , allows for faster moves on the surface of the function h to maximize, whose negative is called energy. Therefore, rescaling partially avoids the trapping attraction of local maxima. Given T T a temperature parameter T > 0, a sample 1 ; 2 ; ::: is generated from the distribution ( ) / exp(h( )=T) and can be used as in x5.2.1 to come up with an approximate maximum of h. As T decreases towards 0, the values simulated from this distribution become concentrated in a narrower and narrower neighborhood of the local maxima of h (see Theorem 5.2.7, Problem 5.9, and Winkler 1996). The fact that this approach has a moderating e ect on the attraction of the local maxima of h becomes more apparent when we consider the simulation method proposed by Metropolis et al. (1953). Starting from 0 , is generated from a uniform distribution on a neighborhood V ( 0 ) of 0 or, more generally, from a distribution with density g(j ? 0 j), and the new value of is generated as follows: with probability = exp( h=T) ^ 1, 1= 0 with probability 1 ? , where h = h( ) ? h( 0 ). (This method is in fact the Metropolis algorithm, which simulates the density proportional to expfh( )=T g, described and justi ed in Chapter 6.) Therefore, if h( ) h( 0 ), is accepted with probability 1. On the other hand, if h( ) < h( 0 ), may still be accepted with probability 6= 0 and 0 is then changed into . This property allows the algorithm to escape the attraction of 0 if 0 is a local maximum of h, with a probability which depends on the choice of the scale T, compared with the range of the density g. In its most usual implementation, the simulated annealing algorithm modi es the temperature T at each iteration; it is then of the form

5.2.3 Simulated Annealing

Algorithm A.20 {Simulated Annealing{

2 This name is borrowed from the metallurgy vocabulary: A metal manufactured by a slow decrease of the temperature (annealing) is stronger than a metal manufactured by a fast decrease of the temperature. The vocabulary also relates to Physics, since the function to minimize is called energy and the variance factor T which controls convergence temperature. We will try to keep these idiosyncrasies to a minimal level, but they are quite common in the literature. 3 This paper is also the originator of the Markov Chain Monte Carlo methods developed in the following chapters. The potential of these two simultaneous innovations has been discovered much latter by statisticians (Hastings 1970; Geman and Geman 1984) than by of physicists (see also Kirkpatrick et al. 1983).

5.5.5 ]

PROBLEMS

211

where

**ti ( ; 2 ) = IE Zi jXi; ; 2 ] and vi ( ; 2 ) = IE Zi2 jXi ; ; 2 ]
**

IE Zi jXi; ; 2 ] = IE Zi2 jXi; ; 2 ] = + Hi u ?

(d) Show that

2 + 2 + (u + )Hi u ?

5.18 The EM algorithm can also be implemented in a Bayesian hierarchical

'(t) 1?'((tt)) if Xi = 1 ? (t) if Xi = 0: 2 (e) Show that ^(j) converges to ^ and that ^(j) converges to ^ 2 , the ML esti2. mates of and

where

Hi(t) =

(

model to nd a posterior mode. Suppose that we have the hierarchical model X j f (xj ) ; j ( j ); ( ); where interest would be in estimating quantities from ( jx). Since ( jx) =

Z

( ; jx)d ;

where ( ; jx) = ( j ; x) ( jx), the EM algorithm is a candidate method for nding the mode of ( jx), where would be used as the augmented data. (a) De ne k( j ; x) = ( ; jx)= ( jx), and show that log ( jx) =

Z

log ( ; jx)k( j ; x)d ?

Z

log k( j ; x)k( j ; x)d :

(b) If the sequence ( ^(j) ) satis es max

Z

log ( ; jx)k( j (j) ; x)d =

Z

log ( (j+1) ; jx)k( j (j) ; x)d ;

show that log ( (j+1) jx) log ( (j) jx). Under what conditions will the sequence ( ^(j) ) converge to the mode of ( jx)? (c) For the hierarchy X j N ( ; 1) ; j N ( ; 1) ; with ( ) = 1, show how to use the EM algorithm to calculate the posterior mode of ( jx).

CHAPTER 6

**The Metropolis-Hastings Algorithm
**

"What's changed, except what needed changing?" And there was something in that, Cadfael re ected. What was changed was the replacement of falsity by truth, and however hard the assimilation might be, it must be for the better. Truth can be costly, but in the end, it never falls short of value for the price paid. |Ellis Peter, The Confession of Brother Haluin| `Have you any thought', resumed Valentin, `of a tool with which it could be done?' `Speaking within modern probabilities, I really haven't,' said the doctor. |G.K. Chesterton, The Innocence of Father Brown|

6.1 Monte Carlo Methods based on Markov Chains

Chapter 3 has shown that it is not necessary to use a sample from the distribution f to approximate the integral

Z

h(x)f(x)dx ;

since importance sampling techniques can be used. This chapter develops this possibility in a di erent way and shows that it is possible to obtain a sample x1; ; xn distributed from f without simulating from f. The basic principle underlying the methods described in this chapter and the following chapters is to use an ergodic Markov chain with stationary distribution f: for an arbitrary starting value x(0), a chain (X (t) ) is generated from a transition kernel with stationary distribution f, which moreover ensures the convergence in distribution of (X (t) ) to f. (Given that the chain is ergodic, the starting value x(0) is, in principle, unimportant.) For instance, for a \large enough" T, X (T ) can thus be considered as distributed from f and the methods studied in this chapter produce a sample X (T ) ; X (T +1) ; : : :, which is generated from f, even though the X (T +t) 's are not independent. De nition 6.1.1 A Markov chain Monte Carlo (MCMC) method for the simulation of a distribution f is any method producing an ergodic Markov chain (X (t) ) whose stationary distribution is f.

as detailed in x6. As a corollary. where we have produced only one single general method of simulation.1? Despite its formal aspect. has fundamentally di erent methodological and historical motivations.1 f ? In comparison with the techniques developed in Chapter 3. there is no need for the generation of n independent chains 1 For example.11).3) guarantees the convergence of the empirical average T 1 X h(x(t)) (6. if there is no particular requirement on independence but if the incentive for the simulation study is rather on the properties of the distribution f. it is sometimes more e cient to use the pair (f. this de nition implies that the use of a Why should we resort to such a convoluted approach to simulate from chain (X (t) ) resulting from a Markov Chain Monte Carlo algorithm with stationary distribution f is similar to the use of an iid sample from f in the sense that the ergodic theorem (Theorem 4.214 THE METROPOLIS-HASTINGS ALGORITHM 6. This chapter covers the most general MCMC method. even when an accept-reject algorithm is available. in Bayesian inference. The call for Markov chains is nonetheless justi ed from at least two points of view: First. Chapter 9 considers the case of latent variable models.2). optimized. A sequence (X (t) ) produced by a Markov Chain Monte Carlo algorithm can thus be employed just as an iid sample.1.6.1.7.1. the ARS algorithm (x2.1. if possible. Thus. g) through a Markov chain. Therefore. as it relies on asymptotic convergence properties. while Chapter 7 specializes in the Gibbs sampler which. although a particular case of Metropolis{Hastings algorithm (see Theorem 7. this involved strategy may indeed sound suboptimal. in particular. the number of iterations required to obtain a good approximation of f is a priori important.1. which prohibit both analytic processing and numerical approximation in both classical (maximum likelihood) and Bayesian setups (see also Example 1. Second. How should we implement the principle brought forward by De nition 6. Markov Chain Monte Carlo methods achieve a \universal" dimension in the sense that they (and not only formally) validate the use of positive densities g for the simulation of arbitrary distributions of interest f.1 may seem purely formal at this stage. . which moreover only applies for log-concave densities. the (re-)discovery of Markov Chain Monte Carlo methods by the statisticians in the 1990's has undoubtedly induced considerable progress in simulation-based inference and. the call to Markov chains allows for a much greater generality than the methods presented in Chapter 2.4).3.1) T t=1 to the quantity IEf h(X)]. since it has opened access to the analysis of models which were too complex to be satisfactorily processed by previous schemes1 . namely the Metropolis{Hastings algorithm. even if the extension proposed in De nition 6.2.1. Chapter 5 has shown that stochastic optimization algorithms as those of Robbins-Monro or the SEM algorithm naturally produce Markov chain structures which should be generalized and. Moreover.

de ned with respect to the dominating measure for the model (see x6. Introduced by Metropolis. then? Given the principle stated in De nition 6. on methods used in statistical physics. handling this sequence is rather more arduous than in the iid case because of the dependence structure. for instance. For another. This gap of more than thirty years can be partially attributed to the lack of appropriate computing power since most of the examples now processed by Markov Chain Monte Carlo algorithms could not have been treated previously. Which transition should we use.2 The Metropolis{Hastings algorithm The Metropolis{Hastings algorithm starts with a conditional density q(yjx). these methods have later been generalized by Hastings (1970) and Peskun (1973) to statistical simulation. we do not include examples in this section. the determination of the \proper" length T is still under debate.2 ] THE METROPOLIS{HASTINGS ALGORITHM 215 (Xi(t) ) (i = 1. one can propose a in nite number of practical implementations based. It can only be implemented in practice when q( jx) is easy to simulate and is either explicitly available (up to a multiplicative constant independent from x). Rosenbluth.1. as we will see in Chapter 8. a single Markov chain is enough to ensure a proper approximation through estimates like (6. and some approaches to the convergence control of (6.1. : : :. Despite several later attempts in speci c settings (see for example Geman and Geman 1984.4 and in Chapter 8. 6. which is much more restrictive.6. but rather wait for x6. Teller and Teller (1953) in a setup of optimization on a discrete state space. as detailed in Chapter 7). Before illustrating the universality of Metropolis{Hastings algorithms and demonstrating their straightforward implementation.1) of IEf h(X)] for the functions h of interest (and sometimes even of the density f. Besag 1989).1 De nition 6. this approach induces a considerable waste of n(T ? 1) simulations out of nT . . that is such that q(xjy) = q(yjx).In other words.1) are given in x6.4 for a non-trivial example). we rst evacuate the (important) issue of their theoretical validity. or symmetric.2 6. which presents a typology of these types. 2 For one thing. Tanner and Wong 1987. while allowing for a wide choice of possible implementations.2.3. the starting point for an intensive use of these methods by the statistical community can be traced to the presentation of the Gibbs sampler by Gelfand and Smith (1990). Since the results presented below are valid for all types of Metropolis{Hastings algorithms. in sharp contrast with the Gibbs sampler given in Chapter 7. Obviously. The Metropolis{ Hastings algorithms proposed in this chapter have the de nite advantage of imposing minimal requirements on the study of the density f.1.3. n) where the \terminal" values Xi(T ) only are kept. Rosenbluth.1.

For instance. Yt q(yjx(t)). It is obviously necessary to impose minimal conditions on the conditional distribution q for f to be the limiting distribution of the chain (X (t) ) produced by A:24].2 The Metropolis{Hastings algorithm associated with the objective distribution f and the conditional distribution q produces a Markov chain (X (t) ) through the following transition: Given Algorithm A. A:24] The distribution q will be called the instrumental (or proposal ) distribu- This algorithm always accepts values yt such that the \likelihood ratio" f(yt )=q(yt jx(t)) is increased compared with the previous value.2. ) in the average (6. when given a pair (f. while this is an impossible feature in continuous iid settings.1). and are therefore rejected by the algorithm. Generate 2. The yt 's generated by the algorithm A:24] are thus associated with weights of the form mt =T (mt = 0. Obviously. it involves repeated occurrences of the same value. Like the accept-reject method. For one thing.1. Yt). similar to stochastic optimization methods (see x5.2 and it is possible to use the algorithm A:24] as an alternative to an accept-reject algorithm. if the chain starts with a value x(0) such that f(x(0) ) > 0. the probability (x(t). However. 1. y) = min f(x) q(yjjx) . It is only in the symmetric case that the acceptance is driven by the ratio f(yt )=f(x(t) ). An important feature of the algorithm A:24] is that it may also accept values yt such that the ratio is decreasing. is supposed to be connected .3. There are similarities between A:24] and the accept-reject methods of x2. as discussed in x6. yt ) is only de ned when f(x(t) ) > 0.6.2).1. with probability with probability 1. it follows that f(x(t) ) > 0 for every t 2 IN since the values of yt such that f(yt ) = 0 lead to (x(t) . and the comparison with importance sampling is somehow more relevant. Nonetheless. Both approaches are compared in x6.24 Metropolis{Hastings algorithm x(t). a sample produced by A:24] obviously di ers from an iid sample. E .4. f(y) q(x y) (x. (This assumption is often omitted in the literature but the lack . Take Yt X (t+1) = x(t) where (x(t) . the support of f. since to reject Yt leads to repeat x(t) at time t+1. 1 ? (x(t) .2.216 THE METROPOLIS-HASTINGS ALGORITHM 6. Yt). g). depending on how many times the subsequent values have been rejected. 1 : tion. the Metropolis{Hastings algorithm only depends on the ratios f(yt )=f(x(t) ) and q(x(t)jyt)=q(yt jx(t)) and is therefore independent of normalizing constants. yt) = 0.

in the second and fourth integrals on the right hand side. for x(0) 2 A. it is necessary to proceed on one connected component at a time and show that the di erent connected components of E are linked by the kernel of A:24]. Proof. A)f(x)dx = I A (x) 1 ? f(x) q(x0jx) q(x0jx) f(x)dxdx0 for D = f(x.2 ] THE METROPOLIS{HASTINGS ALGORITHM 217 of connectedness of E can deeply invalidate the Metropolis{Hastings algorithm. Thus. x0) = (x. x0)) x (x0) . This also transforms the set D into . f(x0 ) q(xjx0) f(x) q(x0 jx)g. f is a stationary distribution of the chain (X (t) ) produced by A:24]. we make the change of variables (x0. for every measurable set A. if there exists A E such that Z the algorithm A:24] does not have f as a limiting distribution since.2. Z K(x. See Hobert. x0))q(x0 jx)dx0 I A (x)f(x)dx ZZ 0) 0 = I A (x0 ) f(x ) q(xj0 xx) q(x0jx)f(x)dxdx0 f(x) q(x j Z DZ + I A (x0) q(x0 jx) f(x)dxdx0 Dc ZZ f(x0 ) q(xjx0) ZZ I A (x0)f(x0 )q(xjx0)dxdx0 I A (x)f(x)q(x0 jx)dxdx0 I A (x0) f(x)q(x0 jx)dxdx0 I A (x) f(x0 ) q(xjx0) dxdx0 Z K(x. 8x 2 E . x0)q(x0 jx) + (1 ? (x.6. x0)q(x0 jx)f(x)dxdx0 ZZ + (1 ? (x.) If the support of E is truncated by q. x0). The transition kernel associated with A:24] can be written K(x. we still have = the following result: Theorem 6. x) to (x. x0).1 For every conditional distribution q. Robert and Goutis 1997 for a treatment of the non-connected Gibbs sampler. where x denotes the Dirac mass in x. A f(x)dx > 0 and Z A q(yjx)dy = 0 . In such a case.6. that is. + D ZZ I A (x0) (x. A)f(x)dx = + + Z Z ZD Z ZD D ? ZD c where. the chain (X (t) ) never visits A. Therefore. whose support includes E . In full generality.

y) 1 (see also Winkler 1995). ergodicity follows from Theorem 4.218 THE METROPOLIS-HASTINGS ALGORITHM 6. since f is stationary. 2 The stationarity of f is therefore established for almost any conditional distribution q. This is impossible. Metropolis{Hastings algorithms are only a special case of a more general class of algorithms whose transition is associated with the acceptance probability s(x. As shown by Hastings (1970). y) = 1 is also known as the Boltzman algorithm and used in simulation for particle Physics. Tierney (1994) has indeed established the following result: Theorem 6. given the f-irreducibility of (X (t) ). 1 + f(y)q(xjy) where s is an arbitrary positive symmetric function such that %(x.4. When (X (t) ) is aperiodic. A)f(x)dx = ZZ I A (x0 ) f(x0 ) q(xjx0)dxdx0 = Z A f(x0 )dx0 .2. it is Harris recurrent. Proof.2 Dc . in the discrete case.2 If the chain (X (t) ) is f -irreducible. 0 . and vice versa. a fact which indicates how universal Metropolis{Hastings algorithms are.4). it is positive recurrent.2. The chain (X (t) ) is therefore positive recurrent. Tierney (1995) and Mira. Proof.3.5) for every value x(0). 6. The particular case s(x.2. y) %(x. y) = (6.2. x1) q(x1 jx0) h(x1)dx1 + (1 ? (x0 )) h(x0) . it satis es h(x0 ) = IE h(X (1) )jx0] = IE h(X (t) )jx0] and therefore h is f-almost everywhere constant and equal to IEf h(X)]. then the space X can be written as Ei with P(X (t) 2 Ei ) converging to 0 (see De nition 4. establishing the stationarity of f. Suppose (X (t) ) is not recurrent. although Peskun (1973) has shown that. Geyer and Tierney (1998) propose extensions to the continuous case.2 Convergence Properties In order to establish the ergodicity of (X (t) ) and therefore to validate A:24] as a Markov Chain Monte Carlo algorithm. the performances of this algorithm are always suboptimal when compared with the Metropolis{ Hastings algorithm (see Problem 6. This result can be established by using the fact that the only bounded harmonic functions are constant. If h is an harmonic function.6.3 If (X (t) ) is f -irreducible. we have Z K(x. 2 Theorem 6. Since Z (1) )jx ] = IE h(X (x0 . it is ergodic. we need to prove both the f -irreducibility and the aperiodicity of (X (t) ). If (X (t) ) is in addition aperiodic.3.1) f(x)q(yjx) . Therefore.

the essential supremum ess sup h(x) = inf fw.1. 2.6) but x6. geometric convergence is almost never guaranteed. is rather restrictive besides the E nite case. 6. For instance. if is not bounded from below on a set of measure 1. Take bst = 0 if xs 6= xt and. It is indeed enough that g be non empty in a neighborhood of 0 to ensure the ergodicity of A:24] (see x6.12. This result is important as it characterizes Metropolis{Hastings algorithms which are weakly convergent (see the extreme case of Example 8.1 If the marginal probability of acceptance satis es essf sup (1 ? (x)) = 1.2. (h(x) > w) = 0g .6. De ning. however. since the function is almost always intractable.6.18 and 7.3) of Theorem 6. it cannot be used as a criterion for geometric convergence.6 ] NOTES 257 where c(b) denotes the number of clusters. as stated by Theorem 4.6. . 0 otherwise. In the particular case when E is a small set (see Chapter 4).) G. for every measure and every -measurable function h. it is quite di cult to obtain a convergence stronger than the simple ergodic convergence n of (6. however. It is in fact equivalent to Doeblin's condition. Chapter 7 exhibits continuous examples of Gibbs samplers when uniform ergodicity holds (see Examples 7.4 for the irreducibility of the Metropolis{Hastings Markov chain is particularly well adapted to random walks. it is impossible to establish geometric convergence without a restriction to the discrete case or without considering particular transition densities. the algorithm A:24] is not geometrically ergodic. since Roberts and Tweedie (1996) have come up with chains which are not geometrically ergodic. On the other hand.2. (b) Show that the Swendson-Wang algorithm 1.1. This condition. they have indeed established the following result: Theorem 6. for xs = xt . with transition densities q(yjx) = g(y ? x). choose a color at random on leads to simulations from (Note: This algorithm is acknowledged as accelerating convergence in image processing.6 Notes 6. while being a natural choice for the instrumental distribution q. For every cluster. Therefore.2 has shown that in the particular case of random walks.1 Geometric Convergence of Metropolis-Hastings algorithms The su cient condition (6.1.6.3.1) or than the (total variation) convergence of kPx(0) ? f kTV without introducing additional conditions on f and q. a geometric speed of convergence cannot be guaranteed for A:24]. the number of sites connected by active bonds.e. n bst = 1 with probability 1 ? qst . i. Roberts and Polson (1995) note that the chain (X (t)) is uniformly ergodic.3.2.2 for a detailed study of these methods).6.8).

The fact that the simulated annealing algorithm may give a value of E (X (t+1) ) larger than E (x(t) ) is a very positive feature of the method.1). Note. (1996) have shown that the optimal choice of is 2:4.6. For a given value T > 0. that is " 1+2 X k>0 ? cov X (t) . In the particular case when f is the density of the N (0. is that the acceptance probability converges to 0:234. t As noted in x5. This heuristic rule is based on the asymptotic behavior of an e ciency criterion equal to the ratio of the variance of an estimator based on an iid sample and the variance of the estimator (3. equal to 0:44 for = 2:4. it produces a Markov chain (X (t)) on X by the following transition: 1.6. i. (1996). 1) distribution and when g is the density of a Gaussian random walk with variance . An equivalent version of this emp pirical rule is to take the scale factor in g equal to 2:38= d . Gilks and Roberts (1996) recommend the use of instrumental distributions such that their acceptance rate is close to 1=4 for models of high dimension and is equal to 1=2 for the models of dimension 1 or 2. X (t+k) #?1 in the case h(x) = x.2 A Reinterpretation of Simulated Annealing Consider a function E de ned on a nite set X with such a large cardinal that a minimization of E based on the comparison of the values of E (X ) is not feasible. since it allows for escapes from the attraction zone of local minima of E when T is large enough. does not cover the extension to the case when T varies with t and converges to 0 \slowly enough" (typically in log t). The chain (X (t) ) is therefore associated with the stationary distribution f (x) / exp(?E (x)=T ) provided that the matrix of the q(ijj )'s generates an irreducible chain. with probability exp ?fE(x(t) ) ? E( )g=T ^ 1. 6.2. where d is the . presented in Chapter 4. given the symmetry in the conditional distribution. based on an approximation of x(t) by a Langevin di usion process (see x6. Generate t according to q( jx(t) ). 2.6.3) is based on a conditional density q on X such that q(ijj ) = q(j ji) for every (i.1. approximately 1=4. j ) 2 X 2 . The simulated annealing technique (see x5.258 THE METROPOLIS-HASTINGS ALGORITHM 6. that the theory of homogeneous Markov chains. Gelman et al.6 6. Comparing A:24] and the simulated annealing algorithm. the simulated value t is automatically accepted when E ( t ) E (x(t) ).5) when the dimension of the problem goes to in nity.3.e. The corresponding acceptance rate is = 2 arctan 2 . A second result by Gelman et al. the later appears as a particular case of Metropolis{Hastings algorithm. Take X (t+1) = x(t) t otherwise.3.3 Reference Acceptance Rates Gelman.2. with moreover a dissymmetry in the e ciency in favor of large values of . however.

. who had been sent for him in some haste. He got to his feet with promptitude. since they only require a limited amount of information about the distribution to simulate. the Gibbs sampling algorithm has a number of distinct features: (i) The acceptance rate of the Gibbs sampler is uniformly equal to 1. continuing to look down the nave. The Innocence of Father Brown| 7. as an example of a generic algorithm. It was so simple. . formally.1 General Principles The previous chapter has developed simulation techniques which could be called \generic".3. for he knew no small matter would have brought Gibbs in such a place at all. so obvious he just started to laugh.K.1.4. 7. For instance. when suddenly the solution to the problem just seemed to present itself. \It's just a matter of perspective.1. This is because the choice of an instrumental distribution is essentially reduced to a choice between a nite number of possibilities. my dear boy. ARMS (x6.CHAPTER 7 The Gibbs Sampler He sat. see Theorem 7. a special case of Metropolis{ Hastings algorithm (or rather a combination of Metropolis{Hastings algorithms on di erent components. Chesterton. In contrast. aims at reproducing f in an automatic manner.4. Doherty. the echoes pealing around the deserted church.1). in particular through the calibration of the acceptance rate (see x6. Metropolis{Hastings algorithms achieve higher levels of e ciency when they take into account the speci cs of the distribution f. Therefore. every simulated value is accepted and the results of x6. Satan in St Mary's| In this place he was found by Gibbs." |P. \Just a matter of perspective.10 below)." he used to boom out.1 De nition . |G.1 on the optimal acceptance rates are not valid in this setting. Although the Gibbs sampler is. the properties and performance of the Gibbs sampling method presented in this chapter are very closely tied to the distribution f.] He remembered the voice of his old `Dominus' Father Benedict.3). This also means that the convergence assessment for this algorithm must be treated differently than for Metropolis{Hastings techniques. telling him that there was a solution to every problem..C. However.

fp are called the full conditionals. : : :. : : :. generate (7. Xp). However. as in x6. fp . p A:31] p. For example. xi?1.1 (ii) The use of the Gibbs sampler implies limitations on the choice of instrumental distributions and requires a prior knowledge of some analytical or probabilistic properties of f. The densities f1 .1) Yt fY jX ( jxt?1) Xt fX jY ( jyt) where fY jX and fX jY are the conditional distributions. 2 ( X2t+1) f2 (x2 jx(t+1). we will rst de ne the speci c features of this algorithm. x(t). xpt)). and it is a particular feature of the Gibbs sampler that these are the densities used for simulation. the chain (Xt ) chain has transition kernel K(x. and for t = 1. x(t)). (iii) The Gibbs sampler is. These di erent points will be made clearer after a discussion of the properties of the Gibbs sampler. 2. : : :.3. by construction. y).4. Suppose that for some p > 1 the random variable X 2 X can be written as X = (X1 . which is usually an advantage. So. x(t?1 ).or multidimensional. The sequence (Xt . xi?1. : : :. the construction is still at least two-dimensional. xp fi (xijx1. is a Markov chain. : : :. Yt). x2. : : :. xi+1. ( ( X1t+1) f1 (x1 jx(t). Even though some components of the simulated vector may be arti cial for the problem of interest. as is each sequence (Xt ) and (Yt ) individually. : : :.31 {The Gibbs sampler{ Given x(t) = (x1t). . xpt)). multidimensional. all of the simulations may be univariate. Moreover.1.262 THE GIBBS SAMPLER 7.1 {Bivariate Gibbs sampler{ Let the random variables X and Y have joint density f(x. : : :. : : :. xp ): The associated Gibbs sampling algorithm (or Gibbs sampler) is given by the following transition from X (t) to X (t+1) : ( ( Algorithm A. where the Xi's are either uni. or unnecessary for the required inference.7. Example 7. generate 1. that is Xi jx1.1. : : :. p 1 3 ::: ( ( +1) Xpt+1) fp (xp jx1t+1). 2. for obvious reasons of lack of irreducibility of the resulting chain. and generate a sequence of observations according to the following: Set X0 = x0. : : :. x2. : : :. x ) = Z fX jY (x jy)fY jX (yjx)dy. xi+1. suppose that we can simulate from the corresponding conditional densities f1 . even in a high dimensional problem. (iv) The Gibbs sampler does not apply to problems where the number of parameters varies.

the corresponding density + is f(y1 . Y ) N2 0.1. generate (7. 1 ? 2 ): Obviously. 1 +1 23 2 12 y1 Z expf?y2 ? 12y1y2g ?y (y1 ) / 1 + y + y dy2 e . the Gibbs sampler is Given yt . si 2 f?1.1.2). (X. this is a formal example since the bivariate normal density can be directly simulated by the Box-Muller algorithm (see Example 2. y2. j j s ?J i i X ss (i.2 {Auto-exponential model{ The auto-exponential model of Besag (1974) has been found useful in some aspects of spatial modelling.1 ] GENERAL PRINCIPLES 263 with invariant distribution fX ( ). the other conditionals. y3 ) dy3 (y / expf?+ 1 +y y2++ 12y12 )g . The full conditional densities are exponential. and the marginal distributions have forms such as y2 jy1 y1 X Z +1 0 which cannot be simulated easily.j )2N i j .2. the full conditional distribution is P expf?Hsi ? Jsi j :(i. y2.7. 1 ? 2 ) Yt+1 j xt N ( xt+1 . . 0 23 2 12 1 1 k Example 7. and so are very easy to simulate from. f(s) / exp ?H X and where N denotes the neighborhood relation for the network.j )2N sj g f(si jsj 6=i ) = expf?H ? J P s g + expfH + J P s g j j Pj j expf?(H + j sj )(si + 1)g = 1 + expf?2(H + P s )g .7. For the particular case when y 2 IR3 . y3 ) / expf?(y1 + y2 + y3 + 12 y1 y2 + 23 y2 y3 + 31y3 y1 )g . 1 1 (7. 1g. where f(y1 .1.3) Xt+1 j yt N ( yt . with known ij > 0.1.2) .2. yi jyj 6=i E xp 1 + j 6=i ij yj . For the special case of the bivariate normal density.5. k Example 7.3 {Ising model{ For the Ising model of Example 5. In contrast.

( ( (y1t) . : : :. : : :. : : :.1.12. yp g2 (y2 jy1 . the naive simulationof a normal N (0. the devising and the optimization of the accept-reject algorithm constructed in Example 2.2. Y2jy1 . and the following Gibbs algorithm is implemented. yp g1 (y1 jy2 . De nition 7. write y = (x. : : :. It is therefore particularly easy to implement A:32] for these conditional distributions by updating successively each node of the network. ( ( ( Y2(t+1) g2 (y2 jy1t+1) . : : :. ::: ( ( +1) Yp(t+1) gp (yp jy1t+1) . y3t). please!!! Example 7. 1) till the outcome is above is suboptimal for large values of . For p > 1.12 can be overly costly if the algorithm is only to be used a few times. yp?1 gp (yp jy1 .2. a density g that satis es 7. ypt?1 ). : : :.12. y3. 2.264 THE GIBBS SAMPLER 7. y3. p.32 {Completion Gibbs sampler{ Given 1. it is easy to generalize the Gibbs sampling algorithm by a \demarginalization" or completion construction.4 Given a probability density f. yp?1 ): Y (t) to Y (t+1).2 Completion Z Algorithm A. The density g is chosen so that the full conditionals of g are easy to simulate from. ypt) ). Yp jy1 . : : :. yp ). z) / I x I z expf?(x? ) =2 g : 2 2 2 2 . simulate ( ( Y1(t+1) g1 (y1 jy2t) . z) and denote the conditional densities of g(y) = g(y1 . : : :. is called a completion of f. : : :. consider a truncated normal distribution. k Following the mixture method proposed in x2. yp ). : : :. z) dz = f(x) A:32] We need a simple example here{Dave Win eld? No baseball. Z g(x.1. for instance.2.1 which is a logistic distribution on (si + 1)=2. : : :. f(x) / e?(x? ) =2 I x : As mentioned in Example 2.1. : : :. ypt) ).2.5 {Truncated Normal Distribution{ As in Example 2. g(x. ypt) ). However. yp ) by Y1 jy2.7. An alternative is to use completion. as.1.

there indeed exists a wide choice among the in nite number of densities for which f is a marginal density. 1994).3. In principle.3. g2. second. Student's t distribution can be generated as a mixture of a normal distribution by a 2 distribution. see A:33]. in general. . rst because there is no more results on this topic than on the choice of an optimal density g in the Metropolis{Hastings algorithm and.) The corresponding implementation of A:32] is then X (t) jz (t?1) U ( .2.1. 2 Simulate p f( j 0 ) / Z1 0 e? =2 e? 1+( ? 2 0 )2 ] =2 ?1 d . : : :.7. treated in Chapter 9. ?2 2 log(z (t?1) )]).6 {Cauchy-normal posterior distribution{ As shown in Chapter 2 (x2.1 ] GENERAL PRINCIPLES 265 (This is a special case of slice sampling. gp to converge faster to the maximum of the likelihood function.1).7. which is described in x5. The density f( j 0 ) can be written as the marginal density 1. Consider for instance the density ? =2 f( j 0 ) / 1 + e ? )2 ] : ( 0 This is the posterior distribution of the location parameter in a Cauchy distribution (see Example 8. a natural completion of f in g and of x in y. and even more with recent versions of EM such as ECM and MCEM (see Meng and Rubin 1991.1).2) which also appears in the estimation of the parameter of interest in a linear calibration model (see Example 1. expf?(x(t) ? )2=2 2 g]). In cases when such completions seem necessary (for instance. 1992.3.6. 2. or Liu and Rubin.5. We will not discuss this choice in terms of optimality. when every conditional distribution associated with f is not explicit). Missing data models. which use maximizations of conditional parts of the likelihood like g1.1).) Note the similarity of this approach with the EM algorithm for maximizing a missing data likelihood.3. the Gibbs sampler by no means requires that the completion of f into g and of x in y = (x. z) should be related to the problem of interest. There are therefore settings such that the vector z has no meaning from a statistical point of view and is only a useful device. namely the introduction of data augmentation by Tanner and Wong (1987) (see Note x7. because there exists. p Note that the initial value of z must be chosen so that ?2 2 log(z (0) ) > .1. This kind of decomposition is also useful when the expression 1+( ? 0)2]? appears in a more complex distribution. Z (t) jx(t) U ( 0. Example 7. k The completion of a density f into a density g such that f is the marginal density of g corresponds to one of the rst (historical) appearances of the Gibbs sampling in Statistics. provide a series of examples of natural completion (see also x5.

^ `i = max `i : p i `i = ri log(pi ) + (ni ? pi ) log(1 ? pi ). probit and log-log models.) . : : : . that is. Construct the associated Gibbs sampler and compare with the previous results.7. 2 ). 105 2 ). k?1 N (2 k?1 ? k?2 . the logit. i pi = (exp( + xi)). Yij 7. For each of these models.12. b). while 1 . 7. 4. i + j. pi = 1 ? exp(? exp( + xi )). (1992) impose the constraints 1 > : : : > 4 and 1 < : : : < 3 > : : : > 5 . Gelfand et al. nk 2 ): Determine the value of k and compare the associated Gibbs sampler with the previous implementation.53 y(Breslow and Clayton 1993) In the modelling of breast cancer cases yi D=2 where 1. pi ) of which are killed. ?2 with 2 = 2 = 5 and a = 0.4) is k j j . 2 ): k (a) Derive the Gibbs sampler associated with almost at hyperpriors on the parameters j . the posterior expectation of 7. G a(a. 11) (7. N (0. construct a Gibbs sampler and compute the expected posterior deviance.2. 2 ). : : : . log( i ) = log(di ) + xi + di . : : : . (b) Breslow and Clayton (1993) consider a dependent alternative where (k = 3. 2 ).7. : : : .5 ] PROBLEMS 319 according to age xi and year of birth di . N (0. N (0.5.54 y(Dobson 1983) The e ect of a pesticide is tested against its concentration xi on ni beetles. (Hint: Use the optimal truncated normal accept-reject algorithm of Example 2. an exchangeable solution is Yi P ( i ). k . (a) Give the Gibbs sampler for this model. Three generalized linear models are in competition: exp( ) pi = 1 + exp(+ +xix ) . that is.4) k j 1 . b = 1. . j 6= k N ( k . 2 N (0.55 y(Spiegelhalter et al. (c) An alternative representation of (7. j = ij = i j N ( ij . 1996) Consider a standard Anova model (i = 1.5. 2 ). 5) n n X^ X ! i=1 `i ? i=1 `i . Ri B(ni .

6 ] NOTES 321 7. s >=< r.7. is de ned by < Tr. and Polson (1996) who work with a functional representation of the transition operator of the chain. Schervish and Carlin (1992) de ne the measure as the measure with density 1=g with respect to the Lebesgue measure. T s > 11 This section presents some results on the analysis of Gibbs sampling algorithms from a functional analysis point of view. It may be skipped on a rst reading since it will not be used in the book and remains at a rather theoretical level.6. r satis es jcj jr(y)j dy = = = Z Z jTr(y)j dy Z Z r(y0 ) K (y0.1) K 2 (y. Proof. where g is the density of the stationary distribution of the chain.1.2 Geometric Convergence While11 the geometric ergodicity conditions of Chapter 6 do not apply for Gibbs sampling algorithms. y) dy0 dy ZZ jr(y0 )j K (y0.7.6) with stationary measure g. y0 ) is the transition kernel (7. s >= r(y) s(y) (dy) and de ne the operator T on L2 ( ) by (Tr)(y) = Z r(y0 ) K (y0 . where K (y. which ensures both the compacity of T and the geometric convergence of the chain (Y (t) ). y) g(y0 ) dy0 = g(y). The other eigenvalues of T are characterized by the following result: Lemma 7. Consider an eigenvector r associated with the eigenvalue c. although these are no so easy to implement in practice. g is an eigenvector associated with the eigenvalue 1. y0 ) dy dy < 1. such that Tr = cr. They then de ne a scalar product on L2 ( ) Z < r. The adjoint operator associated with T . Wong and Kong (1995). Since Z K (y0 . y) dy0 dy Z jr(y0 )j dy0 and therefore jcj 1. some su cient conditions can also be found in the literature.e. Liu. In this setting.6.6.1 The eigenvalues of T are all within the unit disk of C. y) g(y0 ) (dy0 ). The approach presented below is based on results by Schervish and Carlin (1992). . i. T . 2 The main requirement in Schervish and Carlin (1992) is the Hilbert-Schmidt condition ZZ (7.

beta) lambda i] theta i] * t i] x i] dpois(lambda i]) f g alpha beta dexp(1. 1995a).6. it has been designed to take advantage of the possibilities of the Gibbs sampler in Bayesian analysis.c) at the MRC Biostatistics Unit in Cambridge. Thomas.6 a review of ner convergence properties. y0 ) is not explicit. which represents a normal modelling with mean 0 and precision (inverse variance) 0:0001.4). A major restriction of this software is the use of the conjugate priors or at least log-concave distributions for the Gibbs sampler to apply..6.1.1) is moreover particularly di cult to check when K (y. like dnorm(0. which also allows for a large range of transforms. out. England.3 The BUGS software The acronym BUGS stands for Bayesian inference using Gibbs sampling. However. .7. BUGS includes a language which is C or S-plus like and involves declarations about the model. The Hilbert-Schmidt condition (7. p.1.6. the batch size being also open. We still conclude with the warning that the practical consequences of this theoretical evaluation seem negligible. we also consider that the eigenvalues of the operators T and F and the convergence of norms kgt ? gkTV to 0 are only marginally relevant in the study of the convergence of the average T 1 X h(y ) (7. At last.18. where the weight wk depend on h and F and the k are the (decreasing) eigenvalues of F (with 1 = 1).9). even though there exists a theoretical connection between the 2 asymptotic variance h of (7. the model and priors are de ned by for (i in 1:N) theta i] dgamma(alpha. the data and the prior speci cations.6) and the spectrum ( k ) of F through 2 X wk 1 + k 1 + 2 h= 1? k 1? 2 k 2 T t=1 t (see Besag and Green 1992 and Geyer 1992).0001).6) to IE h(Y )]. The output of BUGS is a table of the simulated values of the parameters after an open number of warmup iterations.6.0.1. In fact. For instance. In addition.326 THE GIBBS SAMPLER 7. Most standard distributions are recognized by BUGS (21 are listed in Spiegelhalter et al. more complex distributions can be handled by discretization of their support and assessment of the sensitivity to the discretization step. 1995b. improper priors are not accepted and must be replaced by proper priors with small precision. Best and Gilks (1995a. 7.0) (see Spiegelhalter et al. setups where eigenvalues of these operators are available almost always correspond to case where Gibbs sampling is not necessary (see. e.b. This software has been elaborated by Spiegelhalter. Example 7.0) dgamma(0. stat. for single or multiple levels in the prior modelling.6. As shown by its name.g. BUGS also recognizes a series of commands like compile. for the benchmark nuclear pump failures dataset of Example 7. data.

non-parametric smoothing. that is Spring 1998!.56.6 ] NOTES 327 The BUGS manual (Spiegelhalter et al.cam. the authors have written an extended and most helpful example manual (Spiegelhalter et al.6. c). including meta-analysis. latent variable.) The BUGS software is also compatible with the convergence diagnosis software CODA presented in Note x8.uk/bugs for a wide variety of platforms.ac.4. model selection and geometric modelling.7. (Some of these models are presented in Problems 7. 1995b. survival analysis.mrc bsu. 1995a) is quite informative and wellwritten.45{7.14 In addition. which exhibits the ability of BUGS to deal with an amazing number of models. . the BUGS software is available as a freeware on the Web site http://www.7. 14 At the present time.

The Blessing Way| 8. \must have signi cance. And until I know what that signi cance is.1 When and why to stop? The two previous chapters have laid the theoretical foundations of MCMC algorithms and showed that. since they do not directly induce methods to control the chain produced by an algorithm (in the sense of a stopping rule which guarantees that the number of iterations is su cient). they are nonetheless insu cient from the point of view of the implementation of MCMC methods. The Heretic's Apprentice| Leaphorn never counted on luck. The goal of this chapter is therefore to present. Instead. he expected order{the natural sequence of behavior. In other words. the chains produced by these algorithms are ergodic. by describing a sequence of non comparable techniques with widely varying degrees of theoretical justi cation. He counted on that and on his own ability to sort out the chaos of observed facts and nd in them this natural order. While such developments are obviously necessary. there are three (increasingly stringent) types of convergence for which control is necessary: (i) convergence of the chain (t) to the stationary distribution f (or stationarization).CHAPTER 8 Diagnosing Convergence \Everything that is not what it seems and not what it reasonably should be". under fairly general conditions. the human behaving in the way it was natural for him to behave. the cause producing the natural e ect. said Cadfael rmly. |Tony Hillerman. as well as Robert (1995) (The style of this chapter re ects the more recent and exploratory nature of these methods. the general convergence results do not tell us when to stop the MCMC algorithm. sometimes in an allusive mode." |Ellis Peter. I cannot be content. . the numerous methods of control proposed in the literature. as in the reviews of Cowles and Carlin (1994) and Brooks and Roberts (1995). or even geometrically ergodic.) From a general point of view.

(In cases where this assumption is unrealistic. and the storage of the last simulation i(T ) in each chain.330 DIAGNOSING CONVERGENCE 8.) On a general basis. f is only the limiting distribution of (t) . (Note that.2. discussed in x8. on the opposite. Indeed. with lengthy stays in each of these regions (e. slow exploration of the support of f and strong correlations between the (t) 's are rather the issues at stake. In fact. therefore to act as if the chain is already in its stationary regime from the start. the original implementation of the Gibbs sampler was based on the generation of n independent initial values i(0) (i = 1. even when (0) f. notwithstanding the starting distribution. it seems to us that this approach (i) to convergence issues is not particularly fruitful. This may not be the case for high dimensional setups or complex structures where the algorithm is initialized at random. (i) appears as a minimum requirement on a algorithm supposed to approximate simulation from f! For instance. stationarity is therefore only achieved asymptotically and the i(T ) 's are distributed from f T .8. depending on the transition kernel chosen for the algorithm.2. 0 1 We consider a standard statistical setup where the support of f is approximately known. there are methods like the exact simulation of Propp and Wilson (1996). but we do think that convergence to f per se is not the major issue for most MCMC algorithms. and a stationarity test may be instrumental in detecting such di culties. First. in practice.1. .1) T t=1 to IEf h( )] for an arbitrary function h. the question of convergence to the limiting distribution is not really relevant. 0. since. nt) ) to iid-ness.g. where stationarity can be rigorously achieved from the start. as we will see in x8. if 0 is the (initial) distribution of (0) . More precisely.1 (ii) convergence of the empirical average T 1 X h( (t) ) (8. n). . Strictly speaking. This is not to say that stationarity should not be tested at all. it is often possible1 to consider the initial value (0) as distributed from the distribution f. Indeed. : : :. we only consider a single realization (or path) of the chain ( (t) ). since it implies discarding most of the generated variables with little justi cation with regards to points (i) and (ii). from a theoretical point of view. ( ( (iii) convergence of a sample ( 1t) .5. the exploration of the complexity of f by the chain ( (t) ) can be more or less lengthy. the chain may be slow to explore the di erent regions of the support of f. in the sense that the chain truly produced by the algorithm often behaves like a chain initialized from f.6. this approach thus requires a corresponding stopping rule for the correct determination of T (see for instance Tanner and Wong 1987). the modes of the distribution f). The second type (ii) of convergence is deemed to be the most relevant in the implementation of MCMC algorithms. this method induces a waste of resources. If.) This way of evacuating the rst type of control may appear rather cavalier.

g. for convergence assessment.1).1.2. in some settings. =T t=1 var( 1 ) var( k ): k 0 . subsampling may be bene cial (see. This technique.2).2. subsampling is justi ed. = Tk k T 1 1 the variance of satis es t=1 `=1 1 Proof. Brooks and Roberts (1995) relate this convergence to the mixing speed of the chain. .3). the motivation for subsampling is obvious.1. (t) )|which also justi es RaoBlackwellization (see x7.4. the above covariance oscillates with t.. 1992). Robert. the goal is to produce variables i which are (quasi-)independent. if the chain ( (t) ) satis es an interleaving property (see x7.1 describes how Raftery and Lewis (1992a. an alternative is to use sub-sampling (or batch sampling) to reduce correlation between the successive points of the Markov chain. When the covariance covf ( (0) .8. subsamples the chain ( (t) ) with a batch size k. Ryden and Titterington 1998). it is always preferable to use the whole sample for the approximation of IEf h( )]. in the informal sense of a strong (or weak) dependence on initial conditions and of a slow (or fast) exploration of the support of f (see also Asmussen et al. Note at this stage that sub-sampling necessarily leads to losses in e ciency with regards to the second convergence goal.b) estimate this batch size k. (t) ) is decreasing monotonically with t (see x7. In particular. For every k > 1. that is i = 0. i k T 1 X h( (tk?i) ). which complicates the choice of k.2.8. Section x8. 1. Lemma 8. While the solution based on parallel chains mentioned above is not satisfactory. the relevant issue at this stage is to determine a minimal value for T which validates the approximation of IEf h( )] by (8. De ne k . all the modes).k? 1 : .1. as shown by MacEachern and Berliner (1994). if ( (t) ) is a Markov chain with stationary distribution f and if Tk T 1 X h( (t) ) . which is customarily used in numerical simulation (see for instance Schmeiser 1989). checking for the monotone decrease of covf ( (0) . e. Nonetheless.1. The third type of convergence (iii) takes into account independence requirements for the simulated values. considering only the values (t) = (kt) .3)|is not always possible and. However.1 ] WHEN AND WHY TO STOP? 331 The purpose of the control is therefore to determine whether the chain has indeed exhibited all the facets of f (for instance. A formal version of convergence control in this setup is the convergence assessment of x8. Rather than approximating integrals like IEf h( )]. = 1 X h( (k`) ). In fact. k ?1 as the drifted versions of k = k .1 { Consider h 2 L2 (f). While the ergodic theorem guarantees the convergence of this average from a theoretical point of view.

like Gibbs sampling used in highly non linear setups. 2 In the remainder of the chapter. as in renewal theory (see x8.8. Moreover. namely (a) that the slower chain governs convergence and (b) that the choice of the initial distribution is quintessential to guarantee that the di erent chains are welldispersed. from the Cauchy{Schwarz inequality.3). shape of high density regions. we also distinguish between the meth(t ods involving the simulation in parallel of M independent chains ( m) ) (1 m M) and those based on a single `on-line' chain. In fact. even though they seem to propose a evaluation of convergence which is more robust that for single chain methods. usually favor single chains. Besides distinguishing between convergence to stationarity (x8. k )=k2 1 var( k var( k )=k2 i6=j = var( k ) . Liu and Rubin (1995) and Johnson (1996) can be similarly criticized. The motivation of the former is intuitively sound: by simulating several chains. variability and dependence on the initial values are reduced and it should be easier to control convergence to the stationary distribution by comparing the estimations of quantities of interest on the di erent chains.2) and convergence of the average (x8. Liu. For instance. an initial distribution which is too concentrated around a local mode of f does not contribute signi cantly more than a single chain to the exploration of the speci cities of f. The dangers of a naive implementation of this principle should be obvious. we consider independence issues only in cases where they have bearing on the control of the chain. which will presumably stay in the neighborhood of the starting point with higher probability. The elaborate developments of Gelman and Rubin (1992). In other words.). An 1 )=k + i6=j X . etc.2.332 1 DIAGNOSING CONVERGENCE 8.1 can then be written under the form X 1 k?1 i : =k 1 k i=0 Therefore X ! 1 k?1 i var( 1 ) = var k k i=0 X i i j = var( k )=k + cov( k . slow algorithms. good performances of these parallel methods require a su cient a priori knowledge on the distribution f in order to construct an initial distribu(0) tion on m which takes into account the speci cities of f (modes. Geyer (1992) points out that this robustness is illusory from several points of view. in the sense that a unique chain with MT observations and a slow rate of mixing is more likely to get closer to the stationary distribution than M chains of size T.3).

invalidate an arbitrary indicator." . For instance. 8. let us agree with many authors that it is somehow illusory to aim at controlling the ow of a Markov chain and assessing its convergence behavior from a single realization of this chain. Best.) The setting is certainly no better for the Markov chain methods and they should be used with appropriate caution. There always are settings (i. Moreover.)On the other hand. for most realizations.2. (See Tierney 1994 and Raftery and Lewis 1996 for other criticisms. whatever its theoretical validation. a single chain may present probabilistic pathologies which are more often avoided by parallel chains. transition kernels) which. The criticisms presented in the wake of the techniques proposed below only highlight the incomplete aspect of each method and therefore do not aim at preventing their utilization but rather to warn against a selective interpretation of their results. and the randomness inherent to the nature of the problem prevents any categorical guarantee of performance. Far from being a failure acknowledgment. it is simply inconceivable in the light of recent results to envision automated stopping rules. as in Gelfand and Smith (1990). single chain methods su er more severely from the defect that "it only sees where it went". or Robert 1997a for illustrations). this plot is only useful 2 To borrow from the injunction of Hastings (1970).2 Monitoring Convergence to the Stationary Distribution 8.8.1 Graphical Methods A natural empirical approach to convergence control is to draw pictures of the output of simulated chains. Brooks and Roberts (1995) also stress that the prevalence of a given control method strongly depends on the model and on the inferential problem under study. The crux of the di culty is actually similar to Statistics. the goal being nowadays to develop \convergence control spreadsheets". It is therefore even more crucial to develop robust and general evaluation methods which extend and complement the present battery of stopping criteria. \even the simplest of numerical methods may yield spurious results if insu cient care is taken in their use (. in the sense that the part of the support of f which has not been visited by the chain at time T is almost impossible to detect. in the sense of computer graphical outputs which could present through several graphs di erent features of the convergence properties of the chain under study (see Cowles and Carlin 1996. Cowles and Vines 1996. these remarks only aim at warning the reader2 about the relative value of the indicators developed below. in order to detect deviant or non-stationary behaviors. As noted by Cowles and Carlin (1994). a rst plot is to draw the sequence of the (t) 's against t..e. However. To conclude..2 ] MONITORING CONVERGENCE TO THE STATIONARY DISTRIBUTION333 additional practical drawback of parallel methods is that they require a modi cation of the original MCMC algorithm to deal with the processing of parallel outputs. where the uncertainty due to the observations prohibits categorical conclusions and nal statements.8.

(Hint: Show that (2 ( j t?1) ) (0) based on m parallel chains ( j ) (j = 1. if j . 2 ) ?2 G a(0:5. P 8. t . . ii 8.22 (Dellaportas 1994) Show that Z (x) IEg min 1. 8. 8. ij 1 Jt = m (0) (a) Show that. IE It ] = IE Jt ] = 1. (b) Show that var(It ) var(Jt )=(m ? 1) 8. Derive the Gibbs stopper of x8.20 (Raftery and Lewis 1992a) Deduce (8.2. . P 8.8. p 3=2 T p( + ) q (2 ? ? ) i i ! "0 + 1 : 2 ! 8. j t?1) ) .19 (Problem 8. (2 K ( i(0) . log 1 ?i = .23 (Tanner 1996) Show that.1 by proposing an estimate of t .21 (Raftery and Lewis 1996) Consider the logit model + i N (0. m). Apply the various convergence controls to this case.376 DIAGNOSING CONVERGENCE 8. IE It] IE Jt ]. 0:2): Study the convergence of the associated Gibbs sampler and the dependence on the starting value. if (t) t and if the stationary distribution is the posterior density associated with f (xj ) and ( ).4.25 Propose an estimator of the variance of ST (when it exists) and derive a convergence diagnostic.3) from the normal approximation of T .18 continued) De ne It = m(m1? 1) with t = ij X i6=j t. 8.5 m X i=1 t.1.6. nd a set of parameters ( .26 Check whether the importance sampling ST is (a) available and (b) with nite variance for the Examples of Chapter 7. f (x) = jf (x) ? g(x)jdx g Derive an estimator of the L1 distance between the stationary distribution and the distribution at time t.24 For the witch hat distribution of Example 8. and that for every initial (0) distribution on the j 's. the weight (t) (t) !t = f (xj t ( )(t)() ) converges to the marginal m(x). y) for which the mode at y takes many iterations to be detected.

Ritter and Tanner (1992) propose to use the weight wt through a stopping rule based on the evolution of the histogram of the wt 's. An additional di culty is related to the computation of K and of its normalizing constant. A di erent implementation can be based on ( parallel chains jt) .2) but Ritter and Tanner (1992) develop a general approach (called Gibbs stopper) where. which can induce a considerable increase of the computation time.6 Notes 8. show that a small set is available in the ( j ) space and derive the corresponding renewal probability. and apply the strong Markov property.3.27 For the model of Example 8.8.) 8. Cowles and Carlin (1994) note moreover .2 is a Markov chain. who rst approximated the distribution of (t) to derive a convergence assessment. 8. ) : f^t ( ) = m j=1 Brooks and Roberts (1995) cover the extension to Metropolis{Hastings algorithms (see also Zellner and Min 1995. for a similar approach). the approximation of Z ft ( ) = K ( 0. ) ft?1 ( 0 ) d 0 is m 1 X K ( (t?j). : : : . (t) wt = f ( (t)) : f^t ( ) Note that this evaluation aims more at controlling the convergence in distribution of ( (t) ) to f ( rst type of control) than at measuring the mixing speed of the chain (second type of control).4. ^ In some settings.1 Other Stopping Rules Following Tanner and Wong (1987).3. 0 ) denotes the transition kernel. if K ( .6 ] NOTES 377 8. until these weights are su ciently concentrated near a constant which is the normalizing constant of f . (Hint: Show that n P ( (n) = ij (? ?1) = j. The theoretical foundations of this approximation are limited since the (t?j) 's are not distributed from ft?1 . although it does not enjoy a stronger theoretical validity (see x8.8. However. the example treated in Tanner (1996) does not indicate how the quantitative assessment of concentration is operated. which evaluate the di erence between the limiting distribution f and an approximation of f ^ at iteration t.6. ( n?2 ?1) 2 A` .28 Show that the sequence ( (n)) de ned in x8. (n?2) = `. the approximation ft can be computed by Rao-Blackwellization as in (8. : : :) = IE (0) IIAi ( n?1 ?1+ n ) ( n?1 ?1) 2 Aj . Ritter and Tanner (1992) propose a control method based on the distribution of the weight !t .8. ft .4). since it used characteristics of the distribution f which are usually unknown to calibrate this convergence of the weights wt.

1) TA t=1 1 TA X h( (t) ). Geweke (1992) proposed to use the spectral density of h( (t) ). For instance. Moreover. 2 2 and the estimates A and B of Sh (0) based on both subsamples. B= (t) TB t=T ?T +1 h( ) B 1 T X < 1).9.1) since the limiting variance h of Proposition 4. Sh (w) = 21 t=1 X where i denotes the complex square root of 1. q) process.2 Spectral Analysis As already mentioned in Hastings (1970).6. that is einw = cos(nw) + i sin(nw): The spectral density relates to the asymptotic variance of (8. We can therefore derive a convergence diagnostic from (8.6.8 is given by 2 2 h = Sh (0) : Estimating Sh by non-parametric methods like the kernel method (see Bosq and Lecoutre 1988). we can model ( (t) ) as an ARMA(p. for an introduction). namely that they necessarily induces losses in e ciency in the processing of the problem (since they are based on a less constrained representation of the model).6 ^ that the criterion is sensitive to the choice of m in ft. 8. under an adequate parameterization. the di erence p T r ( A ? B) (8. The values suggested by Geweke (1992) are A = 0:1 and B = 0:5. h( (t) ) einw . A global criticism of this spectral approach also applies to all the methods using a non-parametric intermediary step to estimate a parameter of the model. the calibration of non-parametric estimation methods (as the choice of the window in the kernel method) is always delicate since it is not standardized. and then use partially empirical convergence control methods. respectively. Geweke (1992) takes the rst TA observations and the last TB observations from a sequence of length T to derive A= t=?1 ? cov h( (0) ). and a determination of the size t0 of the training sample.378 DIAGNOSING CONVERGENCE 8. in the sense that large values of m lead to histograms which are much stabler and therefore indicate a seemingly faster convergence. They propose to use the criterion di erently via the empirical variance of the weights !t on parallel chains till it converges to 0. Brooks and Roberts (1995) repeat these criticisms on the di cult interpretation of histograms and the in uence of the size of windows. estimating the parameters p and q. the chain ( (t)) or a transformed chain (h( (t) )) can be considered from a time series point of view (see Gourieroux and Monfort 1990 or Brockwell and Davis 1996. TB = B T and A + B 2 2 .1.8.6. We therefore refer to Geweke (1992) for a more detailed study A+ B A B is a standard normal variable (with TA = A T .1). Asymptotically (in T ).

They then approximate %t by value ` t=1 t=1 %t = m ^ 1 ( min ) .6.6 ] NOTES 379 of this method.6. there exist and 1 > j 2 j > j 3 j such that the quantity %t = IE z (t)] = P ( (t) ) = % + t + O(j 3 jt ). Moreover. . based on ] ] BT (S ) = S Ts^ ? Ts2 .b). 0 s 1 1= (T (0)) where T t X (t) 1 X (t) .6. 8.1). They show that. For large T . under some conditions. for instance). Cowles and Vines 1995.) In their study of the convergence of %t to %. the approach of Garren and Smith (1993) does not require a preliminary evaluation of ( . the estimators of and 2 are instable. . for a discussion). Their method thus provide the theoretical background to Yu and Mykland's (1993) CUSUM criterion (see x8. BT is approximately a Brownian bridge and can be tested as such. but it is quite costly in simulations. 2 with % the limiting value of %t . and Garren and Smith (1993) suggest to choose T such that ^ and ^ 2 remain stable. Other approaches based on spectral analysis are given in Heidelberger and Welsh (1988) and Schruben.3. Note that Heidelberger and Welch (1983) test stationarity via a Kolmogorov-Smirnov test. Singh and Tierney (1983) to test the stationarity of the sequence by Kolmogorov{Smirnov tests (see Cowles and Carlin 1994 and Brooks and Roberts 1995. 2 m X and derive some estimations of %.8.4 The CODA software . with the same initial (0) (1 ` m). and 2 from T X t=n0 +1 II (t) < `=1 ` (^t ? % + % t )2 . ). which is used in some softwares (see Best.2 and Brooks and Roberts 1995. 2 where n0 and T need to be calibrated. When compared with the original binary control method. the expansion of %t around % is only valid under conditions which cannot be veri ed in practice. see x7. Garren and ( Smith (1993) propose to use m parallel chains ( `t) ). (These conditions are related with the eigenvalues of the functional operator associated with the transition kernel of the original chain ( (t) ) and with Hilbert-Schmidt conditions.3 Further discretizations Garren and Smith (1993) use the same discretization z (t) as Raftery and Lewis (1992a. When T is too high.8. =T St = and ^(0) is an estimate of the spectral density. 8.

the method proposed by Propp and Wilson (1996) is called coupling from the past (CFTP). so far. the stationarity of the nite chain obviously transfers to the dual chain. In other cases. While originally intended as an output processor for the BUGS software (see x7. in order to evaluate the necessary computing time or the mixing properties of the chain.3).1. . The appeal of these methods for mainstream statistical problems is yet unclear. In a nite state space X of size k.3. several authors have proposed devices to sample directly from the stationary distribution f . Moreover. Geweke (1992) (x8. do allow for perfect sampling.1). The techniques selected by Best et al. Following Propp and Wilson (1996). it is essential. The MCMC output must however be presented in a very speci c S-plus format to be processed by CODA.15). (0) f (x). to the greater simplicity of these spaces and.e. this is due.4). so that (0) .6 While the methods presented in this chapter are at various stages of their development. in the sense that MCMC methods have been precisely introduced to overcome the di culties of simulating directly from a given distribution.8. algorithms such that (0) f . Raftery and Lewis (1992a) (x8. at varying computational costs11 and for speci c distributions and/or transitions. Note also that in settings when the Duality Principle of x7. for another. even if the latter is continuous. the computation time required to produce (0) exceeds by orders of magnitude the computation time of a (t) from the transition kernel. : : : are nearly independent. this software can also be used to analyze the output of Gibbs sampling and Metropolis{Hastings algorithms.2).6. plus plots of autocorrelation for each variable and of cross-correlations between variables. to start the Markov chain in its stationary regime.6. replacing the exact sampling terminology of Propp and Wilson (1996) with a more triumphing quali cation! The main bulk of the work on perfect sampling deals. in some settings.380 DIAGNOSING CONVERGENCE 8. Heidelberger and Welch (1983) (x8. because the bias caused by the initial value/distribution may be far from negligible.5 Perfect simulation Although the following imperative is rather in opposition with the theme of the previous chapters. with nite state spaces.6. 8.3). from both points of view of (a) speeding up convergence and (b) controlling convergence.1. the convergence issues are reduced to the determination of an acceptable batch size k. but Murdoch and Green (1997) have shown that some standard examples in continuous settings. if it becomes feasible to start from the stationary distribution. (k) . one needs to know. some of the most common techniques have been aggregated in an S-Plus software called CODA. to statistical physics motivations related to the Ising model (see Example 7. Cowles and Vines (1996). that is the convergence diagnostics of Gelman and Rubin (1992) (x8. developed by Best.1). (2k) . like the nuclear pump failure model of Example 7.2 applies. It runs in parallel k chains corresponding to all possible starting points in X farther and farther back in time till all chains take the same value (or coalesce) at time 0 (or earlier). (1996) are mainly those described in Cowles and Carlin (1996). \how long is long enough". 11 In most cases. as put by Fill (1996).4. i. for one thing.18 (see Problem 8. and to the accuracy of an ergodic average. The denomination of perfect sampling for such techniques was coined by Kendall (1996).6.

This algorithm nonetheless requires a high degree of analyticity to compute the expectation (E) step and therefore cannot be used in all settings.1. quality control.1) seem to call naturally for simulation. As mentioned in x5.3. 1998 Missing data models (introduced in x5. (1977) (see x5. Qian and Titterington 1991.2. epidemiological studies. medical experiments. Tanner (1991) or MacLachlan and Krishnan (1996) for deeper perspectives in this domain. See Everitt (1984). It is only with the EM algorithm that Dempster et al.5.2 February 27.) produce a grouping of the original observations in less informative categories. Wei and Tanner 1990.3. Little and Rubin (1987). 9.1 Discrete Data Models Numerous settings (surveys. this intuition has taken a while to formalize correctly. in order for it to replace the missing data part so that one can proceed with a \classical" inference on the complete model.1 Introduction 9.3. and to go further than mere ad hoc solutions with no true theoretical justi cation.2 First examples 9.CHAPTER 9 Assimilation and Application: Missing Data Models Version 1. It must rather be understood as a sequence of examples on a common theme. design of experiment. stochastic versions of EM (Broniatowski. often for reasons beyond . Celeux and Diebolt 1983. However. This chapter mainly aims at illustrating the potential of Markov Chain Monte Carlo algorithms in the Bayesian analysis of missing data models. Celeux and Diebolt 1985. etc.4 and x5. Lavielle and Moulines 1997) have come closer to simulation goals by replacing the E step with a simulated completion of missing data. without however preserving the whole range of EM convergence properties.3) came up with a rigorous and general formulation of statistical inference though completion of missing data. although it does not intend to provide the reader with an exhaustive treatment of these models.

384 ASSIMILATION AND APPLICATION:MISSING DATA MODELS 9. This model describes.xi +20] (yi ) (? 1 + 2 yi ) is useful for the completion of the model through Gibbs sampling.9. (In particular. each corresponding to a grouping of individual data. the approximation bias to the inferior pack in a study on smoking habits. i where is the cdf of the normal distribution N (0. 0) denotes the normal distribution N ( . k When several variables are studied simultaneously in a sample.2. . . otherwise.2. The two distributions above can then be completed by 2 ( .12).xi +1] (yi ) + I xi +1. If the gi 's are known. 1. ( 1 ? 2 yi )).2.1 {Rounding e ect{ Heitjan and Rubin (1991) consider some random variables yi E xp( ) grouped in observations xi (1 i n) according to the procedure y gi jyi B(1. they examine whether the grouping procedure has an e ect on the likelihood and thus must be taken into account. Otherwise. 2 ) / e? yi I xi .xi +1] (yi ) ( 1 ? 2 yi ) +II xi . 1. the result is a contingency table. yi = 20i]y =20] if gi = 1.xi+20] (yi ) (? 1 + 2 yi ) = e? yi I xi . 2) truncated to IR+ (see Example 2. If the context is su ciently informative to allow for a modeling of the individual data. 2) n exp ( X ) n ? i=1 yi I >max( yi +ti ) ( . 1 .2 the control of the experimenter.1) or by introducing an additional arti cial variable ti such that ti jyi N+ ( 1 ( 2 yi . under the assumption that this bias increases with the daily consumption of cigarettes yi]. the completion of the model is straightforward.x +20] (yi ) epti =2 . 2) 1 2 to provide a Gibbs sampling algorithm. 1. 2. yi jti e i i i i 2 where N+ ( . Heitjan and Rubin (1991) (see also Rubin 1987) call the resulting process Data coarsening and study the e ect of data aggregation on the inference. This distribution can be simulated directly by an accept-reject algorithm (which requires a good approximation of |see Example 2. 1) and a] denotes the integral part of a. 0). the completion of the contingency table (by reconstruction of the individual data) may facilitate inference about the phenomenon under study. xi jgi.x +1] (yi ) + I x +1. the conditional distribution (yi jxi. ? ) ? ? yi I x . for instance.) Example 9.

2. the steps of the Gibbs sampler are 2 from the inverted gamma distribution 0 1X IG @164. and if N2T ( . Table 9. 0 9 = ?1(yijk ? ) . 3. .9. : : :. in particular the relation between height and diameter of the branches where they sleep. n12 = 11. .46 {Contingency table completion{ 1. of lizards. nij ). the model can be completed by simulation of the individual values and this allows for the approximation of the posterior distribution associated with the prior ( . Fienberg (1977) proposes a classical solution based on a 2 test.j. Simulate 8 < 1X (9. 2. j = 1.2 ] FIRST EXAMPLES 385 Diameter (inches) 4:0 > 4:0 Height > 4:75 32 11 (feet) 4:75 86 35 Table 9. 1968). To test the independence between these factors. for instance a multidimensional normal distribution with mean = ( 1 . n21 = 86 and n22 = 35. j = 1.k 1 ?1 (yijk ? )A .2.2 { Lizard habitat { Schoener (1968) studies the habitat Algorithm A. A:46] 4. log(4)). However. ?1. k = 1. 2. Observation of two characteristics of the habitat of 164 lizards (Source: Schoener. if n11 = 32.1) (1 ? 2 )?164=2 exp :? 2 (yijk ? )t i. In fact. k = 1. nij ). A possible alternative is based to assume a parametric distribution on the individual observations yijk of diameter and of height (i. 2) and covariance matrix = 2 1 1 = 2 0: The likelihood associated with Table 9.k according to .1 is then implicit.1 provides the information available on these two parameters.2. Simulate yijk N2T ( .1] Example 9. 2. )= 1 I ( ). Qij ) represents the normal distribution restricted to one of the four quadrants Qij induced by (log(4:75).1. =164).2. Qij ) (i. 2 (yijk ? )t i. : : :. Simulate through an Markov Chain Monte Carlo algorithm.j. .9. . Simulate N2 (y. even in the case = 0.2.

: : :.2. for instance. Simulate 2.2) pi = (xt ) . See for instance Gourieroux (1989) for an exhaustive processing of probit and logit models.j. 2 IRp : i Even though the model (9. where some binary variables yi . and the normal distribution truncated on the right in u. ) on .2. (i = 1. but the distribution (9. a survey with multiple questions may include non-answers to some personal questions.1. 1. are modeled through a Bernoulli distribution (9. that is n 1 if yi > 0.1) requires a Metropolis{Hastings step based.2.k yijk =164. For instance. 0) i if if yi = 1. x1. the algorithm which approximates the posterior distribution ( jy1 . on an inverse Wishart distribution k Another setup where grouped data appears in a natural fashion is made of qualitative models.2 Algorithm A. n) A:47] where N+ ( . The analysis of such structures is complicated by the fact that the failure to observe is not always explained. 1g and associated with a vector xi 2 IRp of covariates. yi = 0 otherwise. taking values in f0.2) is generally de ned in this form. 9. If these missing observations are entirely due to chance. 1. respectively. 2. See Albert and Chib (1993b) for an application of this model to longitudinal data for medical experiments and some details on the implementation of A:47].2. 2 . Given a conjugate distribution Np ( 0 . a completed model can be introduced where the completed data yi is grouped according to their sign. yi = 0.15 and 9. The logit model being treated in Problems 7.386 ASSIMILATION AND APPLICATION:MISSING DATA MODELS where y = i. and where X is the matrix whose columns are made of the xi 's.2 Data missing at random Numerous settings may lead to sets of incomplete observations. : : :. ( ?1 + XX t )?1 ): i N+ (xt . u) denote the normal distribution truncated on the left in u. etc. yn . it follows that the incompletely observed data only play a . a pharmaceutical experiment on the aftere ects of a toxic product may skip some doses for a given patient. we consider instead the probit model. Simulate X Np (( ?1 + XX t )?1 ( ?1 0 + yi xi ).47 {Probit posterior distribution{ yi 1. xn) is then P 9. : : :. u) and N? ( .9. The normal truncated distribution can be simulated by the algorithm of Geweke (1991) or Robert (1995). a calibration experiment may lack observations for some values of the calibration parameters. 0) i N? (xt .

The model gets identi able through constraints like 1 = 1 = 1 = 0. i=1 exp ?ra. 1987.m ( 0 + a + s + m ) (9.m. the likelihood of the complete model is much more explicit since it is na.2. sex and marital status.m . However.s.3) where za. ra.2 ] FIRST EXAMPLES 387 Age < 30 > 30 Men Single 20:0 24=1 30:0 15=5 Maried 21:0 5=11 36:0 2=8 Women Single Maried 16:0 16:0 11=1 2=2 18:0 ? 8=4 0=4 Table 9. while na.m and ya.i za.2.s.2.s.s.m 0 a s m 1 + expfw0 + w1ya.m. Example 9.2 =1 2 .3 {Non-ignorable non-response{ Table 9.m denote the number of persons covered by the survey. s (s = 1.s.s. sex and family status.s. 2. The observations are grouped by average.2 m=1.s. respectively.m and where a (a = 1.s.i g a .i )g ( + + + )ra.i denotes the probability of non-response. 2 .m. and we assume an exponential shape for the individual data. the number of responses and the average of these responses by category.m = 0 + a + s + m .9.m.s.m. 2) and m (m = 1. respectively. these distributions are not always explicit and a natural approach leading to a Gibbs sampler algorithm is to replace the missing data by simulation.m.m./male) and family (single/married) e ects.) role through their marginal distribution.2.9. where pa.s.m.i g. a direct analysis of the likelihood is not feasible analytically.i = expfw0 + w1ya.m y a.i (w0 + w1ya.i represents the indicator of missing observation. 2). where 1 i na.i g 1 + expfw0 + w1ya. ya. If the prior distribution on the parameter set is the Lebesgue measure on IR2 for (w0.s.s.m.m.s. 2) correspond to the age (junior/senior). 2. s=1. sex (fem.s.m) with a. On the contrary.i E xp( a. say in the shape of a logit model.2 describes the ( ctious) results of a survey on the income depending on age. Average incomes and numbers of responses/non-responses to a survey on the income by age. w1) and on IR+ for 0. pa.s.m expfz Y Y a. (Source: Little and Rubin.2.s.s. An important di culty appears when the lack of answer depends on the income.s.s.

).20 (Billio. . y2t ). y2t ) is Gaussian. 9. the above steps can be implemented without approximation.19 (Billio. Monfort and Robert 1998) A dynamic desequilibrium model is dened as the observation of where the yit are distributed from a parametric joint model. 1) and the epsilont 's are Np (0. b Show that a possible completion of the model is to rst draw the regime (1 versus 2) and then draw the missing component. y2t ). ). T ) yt = ( + (t?1)2 )1=2 t . c Show that when f (y1t .416 ASSIMILATION AND APPLICATION:MISSING DATA MODELS 9. The latent variables yt are not observed. a Give the distribution of (y1t . a Propose a noninformative prior distribution on the parameter = ( . f (y1t . y2t ) conditional on yt . : : : . yt?1 . where the t 's are iid N (0. Monfort and Robert 1998) Consider the factor ARCH model de ned by (t = 1.a.6 9. yt = min(y1t . ) which leads to a proper posterior distribution.9. b Propose a completion step for the latent variables based on f (yt jyt. . yt = ayt + t .

P. Assoc. 669{679.H. O. A. D. Statist. J. Statist. J. Statist. U. and Sejnowski. Soc.M.E. Modi ed signed log likelihood ratio. Math. 29{35. Azencott. S. P. Amer. Computing 12. and Chib. Athreya. (1985) A learning algorithm for Boltzmann machines. Tech.E. Barbieri. Biometrika 78. (1979) Applied Probability and Queues. O. P. Tech. H. 343-365. U. (1993b) Bayesian analysis of binary and polychotomous response data. London. (1989) Asymptotic Techniques for Use in Statistics. 147{169. and Chib. Trans.H. Hinton. and Stegun. (1994) The Weighted Bootstrap. Dept. 1987{1988 697. .W.J. \La Sapienza". J. Wiley. Barndor -Nielsen. (1979) The computer generation of Poisson random variables. Poisson and binomial distributions. 1037{1044. Modelling and Computer Simulations 2. Dover. Lecture Notes in Statistics 98.. (1992) Stationarity detection in the initial transient problem.B. J. (1988) Computational methods using a Bayesian hierarchical generalized linear model. Uni. 28. T. 245. S. J. R. E. G. Albert. On a formula for the distribution of the maximum likelihood estimator. and Kors. Asmussen. T. Appl. D. S. T. (1996) A reversible jump MCMC sampler for Bayesian analysis of ARMA time series. New York.M. New York. Abramowitz. 493{501. Roma. report. ACM Trans. 130{157. Chapman and Hall. and Dieter.E. 557-563.H. (1989) Simulated Annealing and Boltzman Machines: a Stochastic Approach to Combinatorial Optimisation and Neural Computing. and Titterington. Albert. Barndor -Nielsen. D. Atkinson. (1983).B.References Aarts. and Ney. Seminaire Bourbaki 40ieme annee. and Bertail. Ahrens.R. Business Economic Statistics 1. Barbe. Amer. 88. Archer. 83. (1995) Parameter estimation for hidden Markov chains. beta. (1978) A new approach to the limit theory of recurrent Markov chains. (1974) Computer methods for sampling from gamma. Barndor -Nielsen. (1964) Handbook of Mathematical Functions. (1991). J.H. S.J. P. J. I. Asmussen. Ackley. Glynn. and Cox. M. Biometrika 70. G. Assoc. O. and Thorisson. Amer..J. Cognitive Science 9. (1988) Simulated annealing. of Glasgow. Springer{Verlag. J. and O'Hagan. Wiley.. New York. 223{246. J. of Stat. Albert. (1993a) Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts. M. K. report. 1{15.

Purdue Uni. Smith (Eds. and Green. J. Y. and Mengersen. and Giron. of Statistics.E. Chapman and Hall. Besag. Tech. Spiegelhalter). C) 27. 303{328. J. Statist. report. and Hartigan. 5{124. Ann. J. Besag.. (1989) Improving stochastic relaxation for Gaussian random elds. Berger.H. Bernardo. W.E. DeGroot. and Mengersen. and Wolpert.J. M.J. (1977) Minimum Hellinger distance estimates for parametric models. (1994) Discussion of \Markov chains for exploring posterior distributions". Best. Besag. 67{78. Assoc. and Richard. 339{358. report #9610C.. (1985) Statistical Decision Theory and Bayesian Analysis (2nd edition). Wiley.418 REFERENCES 9. Beran. Ann. Berge. London. 260{279. P.O. 88.L. In Markov chain Monte-Carlo in Practice (Ed. and Smith. J. Statistical Science 10. Berger. Besag. K. New York. L. D. 395{ 407. J. (1996) Estimation of quadratuc functions: reference priors for non-centrality parameters. Journal of the Royal Statistical Society (Series B) 36. and Wake eld. Plann. J. 20.O. New York. (1988) The Likelihood Principle (2nd edition). and Vidal. Richardson and D. Springer-Verlag. J. Higdon. 192{ 326. Bernardo.F. Barcelona. A. (1993) A Bayesian analysis of change point problems. Spain. Applied Statistics (Ser. 19{46. Bernardo. Racine-Poon. Statist. Besag. C. J.F. Lindley and A. Econometrics 29. J. J. Philippe.6 Barone. Biometrics 47. (1992) Product partition models for change point problems. 3{66. R.. 25{38. Wiley. (1996) MCMC for nonlinear hierarchical models. J. R. (1995) Bayesian computation and stochastic systems (with discussion). J. F.C. D. 309{319. Inference 25. In Second Catalan International Symposium on Statistics. (1986) A Bayesian approach to cluster analysis. E. and Hartigan. (1988) A Bayesian analysis of simple mixture problems. A.. Statist. D. 1734-1741. 181.O. C. Ann. Applied Statistics 16.M.M. P. (1984) Order Within Chaos. Bauwens.J. In Bayesian Statistics 3. Hayward. Tech. (1994) An overview of of robust Bayesian analysis (with discussion).T. Pommeau. 5. Dept. (1993) Meta-Analysis via Markov Chain MonteCarlo methods.M.M.M.L.J. J. Oxford University Press. J. Besag. 1473{1487. D. Statist. Barry. Oxford. Gilks. IMS Lecture Notes | Monograph Series 9. S. Colorado State Univ. J.O.V. (1994) Bayesian Theory. J.). of Statistics. Bernardo. Bennett. Journal of the Royal Statistical Society (Series B) 55. New York. Amer. P. Statist. J.F. (1990) Robust Bayesian analysis: sensitivity to the prior. J. Berger. F. California. J. D. (1984) Bayesian Full Information of Simultaneous Equations Models Using Integration by Monte Carlo. (1978) Letter to the editor. L. J. and Frigessi. and Giron. A. Berger. J. (1974) Spatial interaction and the statistical analysis of lattice systems (with discussion). 22. Springer-Verlag. Barry. . J. TEST 3.O.P. (1992) Spatial Statistics and Bayesian computation (with discussion).9. (1985) A 1-1 Poly-t random variable generator with application to Monte Carlo integration. Berger. J. A.R. J.J.. J. 445{463.M. J. Dept. K. Lecture Notes in Economics and Mathematical Systems 232. and Robert. (1989) Towards Bayesian image analysis. Green. Bauwens. New York.