Conditional Expectation and Least Squares Predictions

attention.
Thus, in books termed chosiste (literally “thing-ist”), they make the furniture

of a room more importa citrine, transparent, coarse-grained variety of the silica
mineral quartz (q.v.). Citrine is a semiprecious gem that is valued for its
yellow to brownish colour and its resemblance to the rarer topaz. Colloidally
suspended hydrous iron oxide gives citrine its colour. Natural citrine is rare
compared to amethyst or smoky quartz, both of which are often heated to turn
their natural colour into that of citrine. Citrine is often marketed under
Conditional
various names that confuse it with topaz
expectation and least squares

prediction
An important problem of probability theory is to predict the value of a future
observation Y given knowledge of a related observation X (or, more generally, given
several related observations X1, X2,…). Examples are to predict the future course of the
national economy or the path of a rocket, given its present state.
Prediction is often just one aspect of a “control” problem. For example, in guiding a
rocket, measurements of the rocket’s location, velocity, and so on are made almost
continuously; at each reading, the rocket’s future course is predicted, and a control is
then used to correct its future course. The same ideas are used to steer automatically
large tankers transporting crude oil, for which even slight gains in efficiency result in
large financial savings.
Given X, a predictor of Y is just a function H(X). The problem of “least squares

prediction” of Y given the observation X is to find that function H(X) that is closest
to Y in the sense that the mean square error of prediction, E{[Y − H(X)]2}, is minimized.
The solution is the conditional expectation H(X) = E(Y|X).
In applications a probability model is rarely known exactly and must be constructed

from a combination of theoretical analysis and experimental data. It may be quite
difficult to determine the optimal predictor, E(Y|X), particularly if instead of a single X a
large number of predictor variables X1, X2,… are involved. An alternative is to restrict the
class of functions H over which one searches to minimize the mean square error of
prediction, in the hope of finding an approximately optimal predictor that is much
easier to evaluate. The simplest possibility is to restrict consideration to linear
functions H(X) = a + bX. The coefficients a and b that minimize the restricted mean
square prediction error E{(Y − a − bX)2} give the best linear least squares predictor.
Treating this restricted mean square prediction error as a function of the two
coefficients (a, b) and minimizing it by methods of the calculus yield the optimal
coefficients: b̂ = E{[X − E(X)][Y − E(Y)]}/Var(X) and â = E(Y) − b̂E(X). The
numerator of the expression for b̂ is called the covariance of X and Y and is denoted
Cov(X, Y). Let Ŷ = â + b̂X denote the optimal linear predictor. The mean square error of
prediction is E{(Y − Ŷ)2} = Var(Y) − [Cov(X, Y)]2/Var(X).
If X and Y are independent, then Cov(X, Y) = 0, the optimal predictor is just E(Y), and
the mean square error of prediction is Var(Y). Hence, |Cov(X, Y)| is a measure of the
value X has in predicting Y. In the extreme case that [Cov(X, Y)]2 = Var(X)Var(Y), Y is a
linear function of X, and the optimal linear predictor gives error-free prediction.
There is one important case in which the optimal mean square predictor actually is the
same as the optimal linear predictor. If X and Y are jointly normally distributed, the
conditional expectation of Y given X is just a linear function of X, and hence the optimal
predictor and the optimal linear predictor are the same. The form of the
bivariate normal distribution as well as expressions for the coefficients â and b̂ and for
the minimum mean square error of prediction were discovered by the English
eugenicist Sir Francis Galton in his studies of the transmission of inheritable
characteristics from one generation to the next. They form the foundation of the
statistical technique of linear regression.
The Poisson process and the Brownian motion
process
The theory of stochastic processes attempts to build probability models for phenomena
that evolve over time. A primitive example appearing earlier in this article is the
problem of gambler’s ruin.
The Poisson process
An important stochastic process described implicitly in the discussion of the Poisson
approximation to the binomial distribution is the Poisson process. Modeling the
emission of radioactive particles by an infinitely large number of tosses of a coin having
infinitesimally small probability for heads on each toss led to the conclusion that the
number of particles N(t) emitted in the time interval [0, t] has the Poisson
distribution given in equation (13) with expectation μt. The primary concern of the
theory of stochastic processes is not this marginal distribution of N(t) at a particular
time but rather the evolution of N(t) over time. Two properties of the Poisson process
that make it attractive to deal with theoretically are: (i) The times between emission of
particles are independent and exponentially distributed with expected value 1/μ. (ii)
Given that N(t) = n, the times at which the n particles are emitted have the same joint
distribution as n points distributed independently and uniformly on the interval [0, t].
As a consequence of property (i), a picture of the function N(t) is very easily constructed.

Originally N(0) = 0. At an exponentially distributed time T1, the function N(t) jumps
from 0 to 1. It remains at 1 another exponentially distributed random time, T2, which is
independent of T1, and at time T1 + T2 it jumps from 1 to 2, and so on.
Examples of other phenomena for which the Poisson process often serves as
a mathematical model are the number of customers arriving at a counter and requesting
service, the number of claims against an insurance company, or the number of
malfunctions in a computer system. The importance of the Poisson process consists in
(a) its simplicity as a test case for which the mathematical theory, and hence
the implications, are more easily understood than for more realistic models and (b) its
use as a building block in models of complex systems.
Brownian motion process
The most important stochastic process is the Brownian motion or Wiener process. It
was first discussed by Louis Bachelier (1900), who was interested in modeling
fluctuations in prices in financial markets, and by Albert Einstein (1905), who gave
a mathematical model for the irregular motion of colloidal particles first observed by the
Scottish botanist Robert Brown in 1827. The first mathematically rigorous treatment of
this model was given by Wiener (1923). Einstein’s results led to an early, dramatic
confirmation of the molecular theory of matter in the French physicist Jean Perrin’s
experiments to determine Avogadro’s number, for which Perrin was awarded a Nobel
Prize in 1926. Today somewhat different models for physical Brownian motion are
deemed more appropriate than Einstein’s, but the original mathematical model
continues to play a central role in the theory and application of stochastic processes.
Let B(t) denote the displacement (in one dimension for simplicity) of a colloidally

suspended particle, which is buffeted by the numerous much smaller molecules of the
medium in which it is suspended. This displacement will be obtained as a limit of
a random walk occurring in discrete time as the number of steps becomes infinitely
large and the size of each individual step infinitesimally small. Assume that at
times kδ, k = 1, 2,…, the colloidal particle is displaced a distance hXk, where X1, X2,… are
+1 or −1 according as the outcomes of tossing a fair coin are heads or tails. By time t the
particle has taken m steps, where m is the largest integer ≤ t/δ, and
its displacement from its original position is Bm(t) = h(X1 +⋯+ Xm). The expected value
of Bm(t) is 0, and its variance is h2m, or approximately h2t/δ. Now suppose that δ → 0,
and at the same time h → 0 in such a way that the variance of Bm(1) converges to some
positive constant, σ2. This means that m becomes infinitely large, and h is approximately
σ(t/m)1/2. It follows from the central limit theorem (equation 12) that lim P{Bm(t) ≤ x}
= G(x/σt1/2), where G(x) is the standard normal cumulative distribution function defined
just below equation (12). The Brownian motion process B(t) can be defined to be the
limit in a certain technical sense of the Bm(t) as δ → 0 and h → 0 with h2/δ → σ2.
The process B(t) has many other properties, which in principle are all inherited from the
approximating random walk Bm(t). For example, if (s1, t1) and (s2, t2) are disjoint
intervals, the increments B(t1) − B(s1) and B(t2) − B(s2) are independent random
variables that are normally distributed with expectation 0 and variances equal to
σ2(t1 − s1) and σ2(t2 − s2), respectively.
Einstein took a different approach and derived various properties of the process B(t) by
showing that its probability density function, g(x, t), satisfies the diffusion equation
∂g/∂t = D∂2g/∂x2, where D = σ2/2. The important implication of Einstein’s theory for
subsequent experimental research was that he identified the diffusion constant D in
terms of certain measurable properties of the particle (its radius) and of the medium (its
viscosity and temperature), which allowed one to make predictions and hence to
confirm or reject the hypothesized existence of the unseen molecules that were assumed
to be the cause of the irregular Brownian motion. Because of the beautiful blend of
mathematical and physical reasoning involved, a brief summary of the successor to
Einstein’s model is given below.
Unlike the Poisson process, it is impossible to “draw” a picture of the path of a particle
undergoing mathematical Brownian motion. Wiener (1923) showed that the
functions B(t) are continuous, as one expects, but nowhere differentiable. Thus, a
particle undergoing mathematical Brownian motion does not have a well-defined
velocity, and the curve y = B(t) does not have a well-defined tangent at any value of t. To
see why this might be so, recall that the derivative of B(t), if it exists, is the limit as h →
0 of the ratio [B(t + h) − B(t)]/h. Since B(t + h) − B(t) is normally distributed with mean
0 and standard deviation h1/2σ, in very rough terms B(t + h) − B(t) can be expected to
equal some multiple (positive or negative) of h1/2. But the limit as h → 0 of h1/2/h =
1/h1/2 is infinite. A related fact that illustrates the extreme irregularity of B(t) is that in
every interval of time, no matter how small, a particle undergoing mathematical
Brownian motion travels an infinite distance. Although these properties contradict the
commonsense idea of a function—and indeed it is quite difficult to write down explicitly
a single example of a continuous, nowhere-differentiable function—they turn out to be
typical of a large class of stochastic processes, called diffusion processes, of which
Brownian motion is the most prominent member. Especially notable contributions to
the mathematical theory of Brownian motion and diffusion processes were made
by Paul Lévy and William Feller during the years 1930–60.
A more sophisticated description of physical Brownian motion can be built on a simple

application of Newton’s second law: F = ma. Let V(t) denote the velocity of a colloidal
particle of mass m. It is assumed that
The quantity f retarding the movement of the particle is due to friction caused by the

surrounding medium. The term dA(t) is the contribution of the very frequent collisions
of the particle with unseen molecules of the medium. It is assumed that f can be
determined by classical fluid mechanics, in which the molecules making up the
surrounding medium are so many and so small that the medium can be considered
smooth and homogeneous. Then by Stokes’s law, for a spherical particle in a gas, f =
6πaη, where a is the radius of the particle and η the coefficient of viscosity of the
medium. Hypotheses concerning A(t) are less specific, because the molecules making up
the surrounding medium cannot be observed directly. For example, it is assumed that,
for t ≠ s, the infinitesimal random increments dA(t) = A(t + dt) − A(t) and A(s + ds)
− A(s) caused by collisions of the particle with molecules of the surrounding medium are
independent random variables having distributions with mean 0 and unknown
variances σ2 dt and σ2 ds and that dA(t) is independent of dV(s) for s < t.
The differential equation (18) has the solution
where β = f/m. From this

equation and the assumed properties of A(t), it follows that E[V2(t)] → σ2/(2mf) as t →
∞. Now assume that, in accordance with the principle of equipartition of energy, the
steady-state average kinetic energy of the particle, m limt → ∞E[V2(t)]/2, equals the
average kinetic energy of the molecules of the medium. According to the kinetic theory
of an ideal gas, this is RT/2N, where R is the ideal gas constant, T is the temperature of
the gas in kelvins, and N is Avogadro’s number, the number of molecules in one gram
molecular weight of the gas. It follows that the unknown value of σ2 can be determined:
σ2 = 2RTf/N.
If one also assumes that the functions V(t) are continuous, which is certainly reasonable
from physical considerations, it follows by mathematical analysis that A(t) is a Brownian
motion process as defined above. This conclusion poses questions about the meaning of
the initial equation (18), because for mathematical Brownian motion the term dA(t)
does not exist in the usual sense of a derivative. Some additional mathematical analysis
shows that the stochastic differential equation (18) and its solution equation (19) have a
precise mathematical interpretation. The process V(t) is called the Ornstein-Uhlenbeck
process, after the physicists Leonard Salomon Ornstein and George Eugene Uhlenbeck.
The logical outgrowth of these attempts to differentiate and integrate with respect to a
Brownian motion process is the Ito (named for the Japanese mathematician Itō Kiyosi)
stochastic calculus, which plays an important role in the modern theory of stochastic
processes.
The displacement at time t of the particle whose velocity is given by equation (19) is
For t large compared with β, the first and third terms in this expression are small
compared with the second. Hence, X(t) − X(0) is approximately equal to A(t)/f, and the
mean square displacement, E{[X(t) − X(0)]2}, is approximately σ2/f 2 = RT/(3πaηN).
These final conclusions are consistent with Einstein’s model, although here they arise as
an approximation to the model obtained from equation (19). Since it is primarily the
conclusions that have observational consequences, there are essentially no new
experimental implications. However, the analysis arising directly out of Newton’s
second law, which yields a process having a well-defined velocity at each point, seems
more satisfactory theoretically than Einstein’s original model.
Stochastic processes
A stochastic process is a family of random variables X(t) indexed by a parameter t,
which usually takes values in the discrete set Τ = {0, 1, 2,…} or the continuous set Τ = [0,
+∞). In many cases t represents time, and X(t) is a random variable observed at time t.
Examples are the Poisson process, the Brownian motion process, and the Ornstein-
Uhlenbeck process described in the preceding section. Considered as a totality, the
family of random variables {X(t), t ∊ Τ} constitutes a “random function.”
Stationary processes
The mathematical theory of stochastic processes attempts to define classes of processes
for which a unified theory can be developed. The most important classes are stationary
processes and Markov processes. A stochastic process is called stationary if, for
all n, t1 < t2 <⋯< tn, and h > 0, the joint distribution of X(t1 + h),…, X(tn + h) does not
depend on h. This means that in effect there is no origin on the time axis; the stochastic
behaviour of a stationary process is the same no matter when the process is observed. A
sequence of independent identically distributed random variables is an example of a
stationary process. A rather different example is defined as follows: U(0) is uniformly
distributed on [0, 1]; for each t = 1, 2,…, U(t) = 2U(t − 1) if U(t − 1) ≤ 1/2, and U(t) =
2U(t − 1) − 1 if U(t − 1) > 1/2. The marginal distributions of U(t), t = 0, 1,… are
uniformly distributed on [0, 1], but, in contrast to the case of independent identically
distributed random variables, the entire sequence can be predicted from knowledge
of U(0). A third example of a stationary process is where

the Ys and Zs are independent normally distributed random variables with mean 0 and
unit variance, and the cs and θs are constants. Processes of this kind can be useful in
modeling seasonal or approximately periodic phenomena.
A remarkable generalization of the strong law of large numbers is the ergodic theorem:

if X(t), t = 0, 1,… for the discrete case or 0 ≤ t < ∞ for the continuous case, is a stationary
process such that E[X(0)] is finite, then with probability 1 the average
if t is continuous, converges to a limit as s → ∞.

In the special case that t is discrete and the Xs are independent and identically
distributed, the strong law of large numbers is also applicable and shows that the limit
must equal E{X(0)}. However, the example that X(0) is an arbitrary random variable
and X(t) ≡ X(0) for all t > 0 shows that this cannot be true in general. The limit does
equal E{X(0)} under an additional rather technical assumption to the effect that there is
no subset of the state space, having probability strictly between 0 and 1, in which the
process can get stuck and never escape. This assumption is not fulfilled by the
example X(t) ≡ X(0) for all t, which gets stuck immediately at its initial value. It is
satisfied by the sequence U(t) defined above, so by the ergodic theorem the average of
these variables converges to 1/2 with probability 1. The ergodic theorem was first
conjectured by the American chemist J. Willard Gibbs in the early 1900s in
the context of statistical mechanics and was proved in a corrected, abstract formulation
by the American mathematician George David Birkhoff in 1931.
Markovian processes
A stochastic process is called Markovian (after the Russian mathematician Andrey
Andreyevich Markov) if at any time t the conditional probability of an arbitrary future
event given the entire past of the process—i.e., given X(s) for all s ≤ t—equals the
conditional probability of that future event given only X(t). Thus, in order to make a
probabilistic statement about the future behaviour of a Markov process, it is no more
helpful to know the entire history of the process than it is to know only its current state.
The conditional distribution of X(t + h) given X(t) is called the transition probability of
the process. If this conditional distribution does not depend on t, the process is said to
have “stationary” transition probabilities. A Markov process with stationary transition
probabilities may or may not be a stationary process in the sense of the preceding
paragraph. If Y1, Y2,… are independent random variables and X(t) = Y1 +⋯+ Yt, the
stochastic process X(t) is a Markov process. Given X(t) = x, the conditional probability
that X(t + h) belongs to an interval (a, b) is just the probability that Yt + 1 +⋯
+ Yt + h belongs to the translated interval (a − x, b − x); and because of independence this
conditional probability would be the same if the values of X(1),…, X(t − 1) were also
given. If the Ys are identically distributed as well as independent, this transition
probability does not depend on t, and then X(t) is a Markov process
with stationary transition probabilities. Sometimes X(t) is called a random walk, but this
terminology is not completely standard. Since both the Poisson process and Brownian
motion are created from random walks by simple limiting processes, they, too, are
Markov processes with stationary transition probabilities. The Ornstein-Uhlenbeck
process defined as the solution (19) to the stochastic differential equation (18) is also a
Markov process with stationary transition probabilities.
The Ornstein-Uhlenbeck process and many other Markov processes with stationary
transition probabilities behave like stationary processes as t → ∞. Roughly speaking, the
conditional distribution of X(t) given X(0) = x converges as t → ∞ to a distribution,
called the stationary distribution, that does not depend on the starting value X(0) = x.
Moreover, with probability 1, the proportion of time the process spends in any subset of
its state space converges to the stationary probability of that set; and, if X(0) is given the
stationary distribution to begin with, the process becomes a stationary process. The
Ornstein-Uhlenbeck process defined in equation (19) is stationary if V(0) has a normal
distribution with mean 0 and variance σ2/(2mf).
At another extreme are absorbing processes. An example is the Markov process

describing Peter’s fortune during the game of gambler’s ruin. The process is absorbed
whenever either Peter or Paul is ruined. Questions of interest involve the probability of
being absorbed in one state rather than another and the distribution of the time until
absorption occurs. Some additional examples of stochastic processes follow.
The Ehrenfest model of diffusion
The Ehrenfest model of diffusion (named after the Austrian Dutch physicist Paul
Ehrenfest) was proposed in the early 1900s in order to illuminate the statistical
interpretation of the second law of thermodynamics, that the entropy of a closed system
can only increase. Suppose N molecules of a gas are in a rectangular container divided
into two equal parts by a permeable membrane. The state of the system at time t is X(t),
the number of molecules on the left-hand side of the membrane. At each time t = 1, 2,…
a molecule is chosen at random (i.e., each molecule has probability 1/N to be chosen)
and is moved from its present location to the other side of the membrane. Hence, the
system evolves according to the transition probability p(i, j) = P{X(t + 1) = j|X(t) = i},
where
The long run behaviour of the Ehrenfest process can be inferred from general theorems
about Markov processes in discrete time with discrete state space and stationary
transition probabilities. Let T(j) denote the first time t ≥ 1 such that X(t) = j and set T(j)
= ∞ if X(t) ≠ j for all t. Assume that for all states i and j it is possible for the process to go
from i to j in some number of steps—i.e., P{T(j) < ∞|X(0) = i} > 0. If the equations
have a solution Q(j) that is a probability distribution—

i.e., Q(j) ≥ 0, and ΣQ(j) = 1—then that solution is unique and is the stationary
distribution of the process. Moreover, Q(j) = 1/E{T(j)|X(0) = j}; and, for any initial
state j, the proportion of time t that X(t) = i converges with probability 1 to Q(i).
For the special case of the Ehrenfest process, assume that N is large and X(0) = 0.
According to the deterministic prediction of the second law of thermodynamics,
the entropy of this system can only increase, which means that X(t) will steadily increase
until half the molecules are on each side of the membrane. Indeed, according to the
stochastic model described above, there is overwhelming probability that X(t) does
increase initially. However, because of random fluctuations, the system occasionally
moves from configurations having large entropy to those of smaller entropy and
eventually even returns to its starting state, in defiance of the second law of
thermodynamics.
The accepted resolution of this contradiction is that the length of time such a system
must operate in order that an observable decrease of entropy may occur is so
enormously long that a decrease could never be verified experimentally. To consider
only the most extreme case, let T denote the first time t ≥ 1 at which X(t) = 0—i.e., the
time of first return to the starting configuration having all molecules on the right-hand
side of the membrane. It can be verified by substitution in equation (20) that the
stationary distribution of the Ehrenfest model is the binomial distribution
and hence E(T) = 2N. For example, if N is only 100 and transitions occur
at the rate of 106 per second, E(T) is of the order of 1015 years. Hence, on the macroscopic
scale, on which experimental measurements can be made, the second law of
thermodynamics holds.
The symmetric random walk
A Markov process that behaves in quite different and surprising ways is the symmetric
random walk. A particle occupies a point with integer coordinates in d-
dimensional Euclidean space. At each time t = 1, 2,… it moves from its present location
to one of its 2d nearest neighbours with equal probabilities 1/(2d), independently of its
past moves. For d = 1 this corresponds to moving a step to the right or left according to
the outcome of tossing a fair coin. It may be shown that for d = 1 or 2 the particle returns
with probability 1 to its initial position and hence to every possible position infinitely
many times, if the random walk continues indefinitely. In three or more dimensions, at
any time t the number of possible steps that increase the distance of the particle from
the origin is much larger than the number decreasing the distance, with the result that
the particle eventually moves away from the origin and never returns. Even in one or
two dimensions, although the particle eventually returns to its initial position, the
expected waiting time until it returns is infinite, there is no stationary distribution, and
the proportion of time the particle spends in any state converges to 0!
Queuing models
The simplest service system is a single-server queue, where customers arrive, wait their
turn, are served by a single server, and depart. Related stochastic processes are the
waiting time of the nth customer and the number of customers in the queue at time t.
For example, suppose that customers arrive at times 0 = T0 < T1 < T2 <⋯ and wait in
a queue until their turn. Let Vn denote the service time required by the nth customer, n =
0, 1, 2,…, and set Un = Tn − Tn − 1. The waiting time, Wn, of the nth customer satisfies the
relation W0 = 0 and, for n ≥ 1, Wn = max(0, Wn − 1 + Vn − 1 − Un). To see this, observe that
the nth customer must wait for the same length of time as the (n − 1)th customer plus
the service time of the (n − 1)th customer minus the time between the arrival of the (n −
1)th and nth customer, during which the (n − 1)th customer is already waiting but
the nth customer is not. An exception occurs if this quantity is negative, and then the
waiting time of the nth customer is 0. Various assumptions can be made about the input
and service mechanisms. One possibility is that customers arrive according to a Poisson
process and their service times are independent, identically distributed random
variables that are also independent of the arrival process. Then, in terms of Yn = Vn −
1 − Un, which are independent, identically distributed random variables, the recursive
relation defining Wn becomes Wn = max(0, Wn − 1 + Yn). This process is a Markov process.

It is often called a random walk with reflecting barrier at 0, because it behaves like a
random walk whenever it is positive and is pushed up to be equal to 0 whenever it tries
to become negative. Quantities of interest are the mean and variance of the waiting time
of the nth customer and, since these are very difficult to determine exactly, the mean
and variance of the stationary distribution. More realistic queuing models try to
accommodate systems with several servers and different classes of customers, who are
served according to certain priorities. In most cases it is impossible to give a
mathematical analysis of the system, which must be simulated on a computer in order to
obtain numerical results. The insights gained from theoretical analysis of simple cases
can be helpful in performing these simulations. Queuing theory had its origins in
attempts to understand traffic in telephone systems. Present-day research is stimulated,
among other things, by problems associated with multiple-user computer systems.
Reflecting barriers arise in other problems as well. For example,

if B(t) denotes Brownian motion, then X(t) = B(t) + ct is called Brownian motion with
drift c. This model is appropriate for Brownian motion of a particle under the influence
of a constant force field such as gravity. One can add a reflecting barrier at 0 to account
for reflections of the Brownian particle off the bottom of its container. The result is a
model for sedimentation, which for c < 0 in the steady state as t → ∞ gives a statistical
derivation of the law of pressure as a function of depth in an isothermal atmosphere.
Just as ordinary Brownian motion can be obtained as the limit of a rescaled random
walk as the number of steps becomes very large and the size of individual steps small,
Brownian motion with a reflecting barrier at 0 can be obtained as the limit of a rescaled
random walk with reflection at 0. In this way, Brownian motion with a reflecting barrier
plays a role in the analysis of queuing systems. In fact, in modern probability theory one
of the most important uses of Brownian motion and other diffusion processes is as
approximations to more complicated stochastic processes. The exact mathematical
description of these approximations gives remarkable generalizations of the central limit
theorem from sequences of random variables to sequences of random functions.
Insurance risk theory
The ruin problem of insurance risk theory is closely related to the problem of gambler’s
ruin described earlier and, rather surprisingly, to the single-server queue as well.
Suppose the amount of capital at time t in one portfolio of an insurance company is
denoted by X(t). Initially X(0) = x > 0. During each unit of time, the portfolio receives
an amount c > 0 in premiums. At random times claims are made against
the insurance company, which must pay the amount Vn > 0 to settle the nth claim.
If N(t) denotes the number of claims made in time t, then provided

that this quantity has been positive at all earlier times s < t. At the first time X(t)
becomes negative, however, the portfolio is ruined. A principal problem of insurance
risk theory is to find the probability of ultimate ruin. If one imagines that the problem of
gambler’s ruin is modified so that Peter’s opponent has an infinite amount of capital and
can never be ruined, then the probability that Peter is ultimately ruined is similar to the
ruin probability of insurance risk theory. In fact, with the artificial assumptions that
(i) c = 1, (ii) time proceeds by discrete units, say t = 1, 2,…, (iii) Vn is identically equal to 2
for all n, and (iv) at each time t a claim occurs with probability p or does not occur with
probability q independently of what occurs at other times, then the process X(t) is the
same stochastic process as Peter’s fortune, which is absorbed if it ever reaches the state
0. The probability of Peter’s ultimate ruin against an infinitely rich adversary is easily
obtained by taking the limit of equation (6) as m → ∞. The answer is (q/p)x if p > q—i.e.,
the game is favourable to Peter—and 1 if p ≤ q. More interesting assumptions for the
insurance risk problem are that the number of claims N(t) is a Poisson process and the
sizes of the claims V1, V2,… are independent, identically distributed positive random
variables. Rather surprisingly, under these assumptions the probability of ultimate ruin
as a function of the initial fortune x is exactly the same as the stationary probability that
the waiting time in the single-server queue with Poisson input exceeds x. Unfortunately,
neither problem is easy to solve exactly, although there is a very good approximate
solution originally derived by the Swedish mathematician Harald Cramér.
Martingale theory
As a final example, it seems appropriate to mention one of the dominant ideas of
modern probability theory, which at the same time springs directly from the relation of
probability to games of chance. Suppose that X1, X2,… is any stochastic process and, for
each n = 0, 1,…, fn = fn(X1,…, Xn) is a (Borel-measurable) function of the indicated
observations. The new stochastic process fn is called a martingale if E(fn|X1,…, Xn − 1) = fn −
1 for every value of n > 0 and all values of X1,…, Xn − 1. If the sequence of Xs are outcomes
in successive trials of a game of chance and fn is the fortune of a gambler after the nth
trial, then the martingale condition says that the game is absolutely fair in the sense
that, no matter what the past history of the game, the gambler’s conditional expected
fortune after one more trial is exactly equal to his present fortune. For example,
let X0 = x, and for n ≥ 1 let Xn equal 1 or −1 according as a coin having probability p of
heads and q = 1 − p of tails turns up heads or tails on the nth toss. Let Sn = X0 +⋯+ Xn.
Then fn = Sn − n(p − q) and fn = (q/p)Sn are martingales. One of the basic results of
martingale theory is that, if the gambler is free to quit the game at any time using any
strategy whatever, provided only that this strategy does not foresee the future, then the
game remains fair. This means that, if N denotes the stopping time at which the
gambler’s strategy tells him to quit the game, so that his final fortune is fN, then
Strictly speaking, this result is not true without some additional conditions that must be
verified for any particular application. To see how efficiently it works, consider once
again the problem of gambler’s ruin and let N be the first value of n such that Sn = 0
or m; i.e., N denotes the random time at which ruin first occurs and the game ends. In
the case p = 1/2, application of equation (21) to the martingale fn = Sn, together with the
observation that fN = either 0 or m, yields the equalities x = f0 = E(fN|f0 = x) = m[1
− Q(x)], which can be immediately solved to give the answer in equation (6). For p ≠
1/2, one uses the martingale fn = (q/p)Sn and similar reasoning to obtain
from which the first equation in (6) easily

follows. The expected duration of the game is obtained by a similar argument.
A particularly beautiful and important result is the martingale convergence theorem,

which implies that a nonnegative martingale converges with probability 1 as n → ∞. This
means that, if a gambler’s successive fortunes form a (nonnegative) martingale, they
cannot continue to fluctuate indefinitely but must approach some limiting value.
Basic martingale theory and many of its applications were developed by the American
mathematician Joseph Leo Doob during the 1940s and ’50s following some earlier
results due to Paul Lévy. Subsequently it has become one of the most powerful tools
available to study stochastic processes.
David O. Siegmund
infinitesimal
 Introduction
Fast Facts
 Related Content
Media
 Images
More
 More Articles On This Topic

 Contributors
 Article History
HomeScienceMathematics
infinitesimal
mathematics
Print Cite Share Feedback
By The Editors of Encyclopaedia Britannica • Edit History
Key People:

John Wallis Augustin-Louis Cauchy James Stirling Giuseppe Peano
Related Topics:

mathematics calculus
See all related content →
infinitesimal, in mathematics, a quantity less than any finite quantity yet not zero.

Even though no such quantity can exist in the real number system, many early attempts
to justify calculus were based on sometimes dubious reasoning about
infinitesimals: derivatives were defined as ultimate ratios of infinitesimals,
and integrals were calculated by summing rectangles of infinitesimal width. As a
result, differential and integral calculus was originally referred to as the infinitesimal
calculus. This terminology gradually disappeared as rigorous concepts
of limit, continuity, and the real numbers were formulated.

Conditional Expectation and Least Squares Predictions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Conditional Expectation and Least Squares Predictions

Uploaded by

Copyright:

Available Formats

attention.

Thus, in books termed chosiste (literally “thing-ist”), they make the furniture

expectation and least squares

Given X, a predictor of Y is just a function H(X). The problem of “least squares

In applications a probability model is rarely known exactly and must be constructed

As a consequence of property (i), a picture of the function N(t) is very easily constructed.

Let B(t) denote the displacement (in one dimension for simplicity) of a colloidally

A more sophisticated description of physical Brownian motion can be built on a simple

The quantity f retarding the movement of the particle is due to friction caused by the

The differential equation (18) has the solution

where β = f/m. From this

of U(0). A third example of a stationary process is where

A remarkable generalization of the strong law of large numbers is the ergodic theorem:

if t is continuous, converges to a limit as s → ∞.

At another extreme are absorbing processes. An example is the Markov process

have a solution Q(j) that is a probability distribution—

relation defining Wn becomes Wn = max(0, Wn − 1 + Yn). This process is a Markov process.

Reflecting barriers arise in other problems as well. For example,

If N(t) denotes the number of claims made in time t, then provided

from which the first equation in (6) easily

A particularly beautiful and important result is the martingale convergence theorem,

 More Articles On This Topic

By The Editors of Encyclopaedia Britannica • Edit History

See all related content →

infinitesimal, in mathematics, a quantity less than any finite quantity yet not zero.

You might also like