You are on page 1of 12

Random Number Generation

Uniform random numbers


Rules of the form x
n+1
a + bx
n
(mod m) are called linear congruential generators,
or LCGs. The resulting sequence provides pseudo-uniform integers on 0, . . . , m 1.
An initial value x
0
called the seed must be specied. To get pseudo-uniform draws on
(0, 1), use U
n
= x
n
/m.
The values of a, b, and m must be chosen very carefully in order to give a sequence
that behaves as an iid uniform sequence should behave.
The period of an LCG is the number of distinct values that occur in the sequence of
x
i
s. Clearly this is at most m, but can be smaller depending on the values of a and b.
A special case of the LCG is the multiplicative congruential generator (MCG), given
by x
n+1
= bx
n
(mod m), or equivalently x
n+1
= x
0
b
n
(mod m).
For a MCG, m should be a prime number and b should be a primitive root mod m,
meaning that the powers of b generate all values between 1 and m1.
Good choices for the parameters are: b = 16807, m = 2
31
1 = 21474836467.
Application of the MCG requires forming the product bx
n
, which will overow a 32-
bit register. The following trick (known as Schrages algorithm) works around this
problem. Write m = bq +r, where r < b. Since b <

m, r < q as well. For z < m1,
let z

= z (mod q), and let z = z/q|. Then bz

and r z are smaller than m, and


bz (mod m) is equal to either bz

r z or bz

r z +m (whichever is positive).
When a ,= 0, the constant m may be non-prime. The standard C library function
drand48() is an LCG with m = 2
48
, a = 11, and b = 25214903917. This random
number generator can be shown to have period 2
48
. The following shows a sequence
from this LCG (the results dier from the results of drand48() due to the use of the
shue box described below).
1
x
i
a +bx
i
x
i+1
x
i+1
/2
48
0 11 11 0.000000
11 277363943098 277363943098 0.000985
277363943098 6993705175256325314877 11718085204285 0.041631
11718085204285 295470392517265591684356 49720483695876 0.176643
49720483695876 1253697219098278389146303 102626409374399 0.364602
102626409374399 2587715051722178864620894 25707281917278 0.091331
25707281917278 648206643511396312177937 25979478236433 0.092298
25979478236433 655070047545450703808072 137139456763464 0.487217
137139456763464 3457958225520320556088499 148267022728371 0.526750
148267022728371 3738538732155529954929218 127911637363266 0.454433
127911637363266 3225279645980899415312933 65633894156837 0.233178
65633894156837 1654952334863192683630540 233987836661708 0.831292
233987836661708 5899980819171657253110247 262259097190887 0.931731
262259097190887 6612837937027380313004390 159894566279526 0.568060
159894566279526 4031726125588636254303353 156526639281273 0.556094
156526639281273 3946804169928216632446352 14307911880080 0.050832
14307911880080 360772623309120026273371 215905707320923 0.767051
215905707320923 5444041665228996928755402 5324043867850 0.018915
5324043867850 134245254577730795368461 71032958119949 0.252360
71032958119949 1791089213934798995940244 83935042429844 0.298197
A rule of the form x
n+1
= a + bx
1
n
(mod m), where a
1
satises m[aa
1
1 is
called an inversive congruential generator, or ICG. ICGs have slightly better statistical
properties than LCGs, but the computation of x
1
is expensive.
All LCGs exhibit some positive autocorrelation. In particular, with the MCG, an
extremely small value is always followed by another small value, and this is true of the
LCG as well if a is small relative to m. For example in the MCG given above, the
frequency of values less than 10000 is 4.7 10
6
, but such a value is always followed
by a value less than .07 m.
The output of LCGs falls mainly on the planes. This means that if consecutive
values are binned into k-tuples z
j
= (x
kj
, x
kj+1
, . . . , x
kj+k1
), then the z
j
tend to fall
into hyperplanes. There are about m
1/k
distinct hyperplanes for most LCGs.
A shue box can break up low-order serial correlations and destroy the hyperplane
structure. To initialize, ll in an array [v
1
, . . . , v
32
] with random numbers, and let k
denote another random number. To generate one draw, let k

denote the high-order


ve bits of k. The output random number is v
k
. Replace k with v
k
, and replace v
k

with a new random number.


Elementary simulation of non-uniform random numbers
CDF transformation:
2
Suppose F(t) is a CDF and U is a uniform deviate, then
t = P(U < t)
F(t) = P(U < F(t))
F(t) = P(F
1
(U) < t).
The CDF of an exponential distribution with unit mean is F(t) = 1 exp(t). We
can solve for the inverse:
F
1
(U) = log(1 U).
Therefore the distribution of log(U) is exponential with mean .
The logistic distribution has CDF F(x) = e
x
/(1 + e
x
), which has inverse F
1
(x) =
log(x/(1 x)). Thus log(U/(1 U)) has a logistic distribution.
The Cauchy distribution has density
1
/(1 + t
2
), and CDF 1/2 + atan(t)/. Thus
tan((U 1/2)) has a Cauchy distribution.
The inversion method can also be applied to discrete random variables. Suppose G
has a geometric distribution, so the mass function is P(G = g) = (1 p)
g1
p and the
CDF is F(g) = 1 (1 p)
g
for g = 1, 2, . . .. Since
P(G = g) = P(G g) P(G g 1)
= P(F(g 1) U F(g))
= P(g 1 log(1 U)/ log(1 p) g),
a geometric random variable can be simulated using log(U)/ log(1 p)|.
Density transformation: Suppose U = [U
1
, . . . , U
k
] is a k-vector of independent uniform
draws, and let G = [g
1
(U), . . . , g
k
(U)] denote a transformation of U. The density of G
is given by P(G) = [J(G)[. For example, take k = 2, and let
g
1
(x
1
, x
2
) =

2 log(x
1
) cos(2x
2
)
g
1
(x
1
, x
2
) =

2 log(x
1
) sin(2x
2
).
It is easy to verify that [J(G)[ =
1
2
exp(G
2
1
/2) exp(G
2
2
/2). This is called the Box-
Muller method for generating Gaussian draws.
3
Many random variables are expressable as functions of other random variables, pro-
viding a way to simulate them. Examples include the Cauchy distribution (Z = X/Y ,
where X and Y are independent and normal), and the
2
p
distribution (Z =

p
i=1
X
2
i
,
where the X
i
are independent and normal; note that this is not an ecient way to
simulate a
2
variate).
A Bernoulli trial with success probability p can be obtained by rounding: B = 1(U <
p). Binomial draws can be obtained by summing independent Bernoulli draws (al-
though there are much better ways to simulate a binomial draw).
Rejection method
Suppose that
1. We want to simulate from the density , which we can evaluate as a function.
2. There is a density f whose sample space contains the sample space of , and we
can easily simulate from f.
3. There is a constant c such that sup
x
(x)/f(x) < c. Note that we must have c 1
since both and f are densities.
Under these circumstances, we can generate a candidate draw from f (which is called
the trial distribution), and make a random decision as to whether we will accept or
reject the draw. If we specify the probability of accepting the draw in a certain way,
then the marginal distribution of the draws that are accepted will be .
To carry out rejection sampling, generate Z according to f, and with probability
(Z)/cf(Z) use Z as the next draw. Otherwise, reject it and draw a new Z. The
resulting distribution has density :
P(Z[accept) = P(accept[Z)P(Z)/P(accept)
= (Z)/cf(Z) f(Z)/P(accept)
= (Z)/cP(accept)
= (Z).
Since P(Z[accept) and (Z) are both densities in Z, it follows that P(accept) = 1/c.
For example, (x) = exp(x
2
/2)/

2 is the standard normal density and f(x) =

1
/(1 +x
2
) is the Cauchy density. It can be shown that
(x)/f(x)

2
e
.
4
Therefore if we simulate a Cauchy draw Z (e.g. using tan(U) where U is uniform),
and then accept it with probability
exp(x
2
/2)/

1
/(1 +x
2
)

e
2
= exp(x
2
/2)(1 +x
2
)

e/2
then the draws that are accepted will be iid standard normal.
The eciency of rejection sampling is determined by P(accept) = 1/c, larger bounds c
yield less ecient schemes. In the limit, if (x)/f(x) is unbounded, rejection sampling
can not be used. For example, if (x) is Cauchy and f(x) is Gaussian, rejection
sampling can not be used.
Suppose we can not nd a tight upper bound for (Z)/f(Z), but we know (Z)/f(Z)
c

. Suppose that c is the tight upper bound. Since P(accept) = 1/c, the expected
number of trials that must be made to get one accepted value is c. Thus the rejection
sampling scheme using c

is c

/c times worse than the optimal rejection sampling scheme


in terms of average performance.
An important class of examples arises when simulating uniform distributions on a
complicated set o. Suppose we embed o in a larger set T on which we can easily
simulate (say a ball or a cube). Let f be the uniform distribution on T , and note that
sup
x
(x)/f(x) = Vol(T )/Vol(o) c. If x o then (x)/cf(x) is 1, otherwise it is
0. So we accept all draws that lie in o and reject all draws in T o. The marginal
acceptance rate is Vol(o)/Vol(T ), so the method is more ecient when T is as small
as possible.
Example: The triangular distribution has density p(x) = 2(1 x)1(0 x 1). We
can simulate from this distribution by rejecting from a uniform [0, 1] trial distribution.
A uniform draw U is accepted with probability 1 U. The marginal acceptance
probability is P(accept) = 1/c = 1/2.
To rejection sample from (x) using trial density f(x) we only need to know (x)
and f(x) up to a multiplicative constant. For example we may be able to evaluate
(x) = a(x) and

f(x) = bf(x) but lack the ability to compute a and b. As long as we
can bound (x)/

f(x) < c, then the rejection sampling as described above still works.
P(Z[accept) = P(accept[Z)P(Z)/P(accept)
= (Z)/c

f(Z) f(Z)/P(accept)
= (Z).
However we now have P(accept) = a/bc, so we do not know the marginal acceptance
rate.
5
We can rejection sample from a continuous trial distribution and achieve a discrete
density. Suppose p(k) (k = 1, 2, . . .), is a probability mass function. We can dene a
denity p(x) = p(x|). If we sample from x p via rejection sampling, then x| has
mass function p.
Suppose we want to sample from a binomial distribution B(p, n). If n is small then
we can simply add independent Bernoulli trials. If n is large, we can use rejection
sampling from a generalized Cauchy distribution with = np, =

np(1 p), and


c = /1.2.
Rejection sampling the gamma distribution
Simulating a Gamma distribution using rejection sampling (method 1): use rejection
sampling from a Cauchy trial distribution. Without loss of generality take = 1, so
we are sampling from the density (x; ) = x
1
exp(x)/(). We consider the casse
> 1. The ratio (x)/f(x) can be written
(x)/f(x) =

()
(exp(( 1) log x x) + exp(( + 1) log x x))

2
()
exp(( + 1) log x x)

2
()
exp(( + 1)(log( + 1) 1)).
Note that c goes to like n
n
/n!, so the eciency of this method is poor for large
(i.e., for = 1, 1/c .294; = 10, 1/c .012; = 100, 1/c .0004).
Simulating a Gamma distribution using rejection sampling (method 2): use rejection
sampling from a generalized Cauchy trial distribution. This distribution has density:
p(x) =
()
1
1 + (
x

)
2
.
We now have the opportunity to optimize the acceptance rate over and . To
sample from the generalized Cauchy distribution, sample Z from a standard Cauchy
distribution, and transform via Z Z +.
It is proved in the following paper that the following values of , , and c are opti-
mal: J.H. Ahrens and U. Dieter. Generating gamma variates by a modied rejection
technique. Comm. ACM, 25(1):4754, January 1982.
= 1
=

2 1
c = exp((1 log()))/().
6
Simulating a Gamma distribution using rejection sampling (method 3): Let
f(x) =
4 exp()
+
x
1
()(

+x

)
2
,
where =

2 1. It is true that (x; ) f(x). Therefore if we can simulate from


a density that is proportional to f, then accept the draw with probability (x; )/f(x),
we have a Gamma draw.
If U is uniform on (0, 1), (u/(1 u))
1/
has density proportional to f.
7
## Generate draws from a gamma distribution with parameters alpha and
## beta=1.
alpha = 3;
## The lambda parameter for the trial distribution.
L = sqrt(2*alpha - 1);
## Count the total number of candidates.
n = 0;
## Generate 1000 draws.
for i=1:1000
## Loop until one draw is accepted.
while (1)
n = n+1;
## Generate the candidate point.
u = rand(1,1);
x = alpha*(u/(1-u))^(1/L);
## The log gamma density at x.
g = -x + (alpha-1)*log(x) - lgamma(alpha);
## The log trial density rescaled so that g/f <= 1.
f = log(4) -alpha + (L+alpha)*log(alpha) + (L-1)*log(x) ...
- lgamma(alpha) - 2*log(alpha^L + x^L);
if (log(rand(1,1)) < g-f)
G(i) = x;
break;
endif
endwhile
endfor
Rejection sampling when sup
x
(x)/f(x) is dicult to determine or incorrectly
specied
Suppose we dont know a c such that (x)/f(x) < c, but we know that a nite such c
exists. Begin with a guess c
0
. Then when we reach Z such that (Z)/f(Z) = c > c
0
,
we need to go back through all points that were previously accepted, and reject each
with probability 1 c
0
/c. Continue in this way, re-evaluating all points each time a
8
new upper bound for /f is discovered. After many samples, the maximum ratio /f
among the generated points will be close to sup /f, so the points that were never
rejected will be very nearly a sample from .
Suppose that sup
x
(x)/cf(x) is nite, but is not necessarily smaller than 1, so we
adopt the acceptance probability P(accept[Z) = min((Z)/cf(Z), 1). Then we have
P(Z[accept) = P(accept[Z)P(Z)/P(accept)
= min((Z)/cf(Z), 1) f(Z)/P(accept)
= min((Z), cf(Z))/cP(accept)
min((Z), cf(Z)).
If (Z) cf(Z), we have the usual rejection sampling and hence the accepted draws
have distribution . If the inequality does not hold, we get a distribution that will be
shaped like on | = Z[(Z)/f(Z) c) (although scaled incorrectly), and shaped
like f on |
c
.
For example, suppose we wish to sample from a standard normal target density ,
using a Cauchy trial density f with c = 1.2, where the correct bound is

2/e 1.52.
The accepted draws will be distributed according to the density on the right, below.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Standard normal
1.2*Cauchy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Importance sampling
One of the main applications of simulation is to estimate a population mean (an in-
tegral) using the sample mean. Given an arbitrary integration problem

h(x)(x)dx,
where > 0 is a weighting function, we can consider the integral as an expectation
with respect to a density f by writing

h(x)(x)dx =

h(x)
(x)
f(x)
f(x)dx,
9
where it is necessary to select f so that h/f is integrable with respect to f. If we
have an iid sample Z
1
, . . . , Z
n
from f, then n
1

i
h(Z
i
)(Z
i
)/f(Z
i
) is a consistent
and unbiased estimate of

h(x)(x)dx. This method for approximating an integral is


called importance sampling.
A practical advantage of importance sampling over rejection sampling is that there is
no need to calculate a bound for the ratio /f.
Note that we did not reject any values produced by f, rather we weighted them with
importance weights w
i
= (Z
i
)/f(Z
i
), so that the estimate is the weighted sample
mean n
1

w
i
h(Z
i
). The Z
i
with small importance weights would likely be rejected
under rejection sampling, but under importance sampling we allow them to contribute
a small degree to the approximation.
As a simple example, consider

exp([x[ [y[)dxdy over the region [5, 5]
2
. The
true value is 4(1 exp(5))
2
3.95. The following two gures show the convergence
behavior of the importance sampling estimate based on a uniform trial density on
[5, 5]
2
, and on a bivariate standard normal trial density truncated to [5, 5]
2
. For
the uniform trial density, the importance weights are w
i
= 100, and for the normal trial
density the importance weights are w
i
= 2(F(5) F(5))
2
exp((z
2
1
+ z
2
2
)/2), where
F is the standard normal CDF.
10
## Do 1000 replicates using a uniform [-5,5]^2 trial density.
for r=1:1000
## Simulate trial density values.
Z = 10*rand(1000,2) - 5;
## The integrand values at the points in Z.
F = e.^(-abs(Z(:,1)) - abs(Z(:,2)));
## Estimate the integral using weights equal to 100.
I1(r) = 100*sum(F)/1000;
endfor
## Do 1000 replicates using a truncated standard normal trial density.
for r=1:1000
## Simulate from a truncated normal on [-5,5]^2.
Z = [];
while (1)
X = randn(1000,2);
ii = find(max(abs(X)) <= 5);
Z = [Z; X(ii,:)];
if (size(Z,1) >= 1000)
break;
endif
endwhile
Z = Z(1:1000,:);
## Importance weights for the truncated normal trial density.
W = 2*pi*(normal_cdf(5) - normal_cdf(-5))^2*exp((Z(:,1).^2+Z(:,2).^2)/2);
## The integrand values at the points in Z.
F = e.^(-abs(Z(:,1)) - abs(Z(:,2)));
## Estimate the integral.
I2(r) = dot(F, W) / 1000;
endfor
The eciency of importance sampling depends on the skew of the weights. If the
weights are highly skewed, then the sample mean is mostly determined by just a few
values, so the usual

n convergence will not hold. From survey sampling, there is a
11
notion of eective sample size (ESS), given by the formula
ESS =
SS
1 + var(w)
where the weighted sample mean should congerge at rate

ESS. Technically this result


may not apply in importance sampling, since the weights and the values being averaged
are dependent, but it provides some heuristic guidance.
A typical application is where is a density, so we are approximating E

h. In this
case E
f
w = 1 and var
f
w
i
= E

w 1, so the eective sample size is SS/E

w.
Example: Suppose we wish to compute the expectation Eh(X), where X has a (, )
distribution. One option is to use rejection sampling from one of the trial densities
given above to obtain an iid sample X
1
, . . . , X
n
from the (, ) distribution, then
estimate the expectation using the simple average n
1

i
h(X
i
).
Another option is to simulate X
1
, . . . , X
n
from an one of the trial densities given above
(call it f), then calculate weights w
i
= exp(X
i
/)X
1
i
/

()f(X
i
). In this case,
the estimate of Eh(X) is the weighted average n
1

i
w
i
h(X
i
).
Suppose we generate Z
1
, . . . , Z
n
from a rejection sampling trial density f, and let D
i
be the indicators of whether Z
i
is accepted (so each D
i
is a Bernoulli trial with success
probability (Z
i
)/f(Z
i
)). The rejection sampling estimator of EZ can be written

i
D
i
Z
i
/

i
D
i
.
View this as an estimator of EZ based on data Z
1
, . . . , Z
n
. By the Rao-Blackwell
theorem,

= E(

[Z
1
, . . . , Z
n
) is unbiased and at least as ecient as

. For large n,

is approximately the importance sampling estimate of EZ, so importance sampling


can be viewed as the Rao-Blackewellization of rejection sampling.
If the densities are only known up to a constant of proportionality, then the importance
sampling weights must be renormalized: w
i
w
i
/

w
i
. The resulting estimate is still
consistent, but is no longer unbiased.
12

You might also like